YouTube Icon

Interview Questions.

Top 100+ Apache Impala Interview Questions And Answers - May 26, 2020

fluid

Top 100+ Apache Impala Interview Questions And Answers

Question 1. How Do I Try Impala Out?

Answer :

To examine the middle capabilities and capability on Impala, the easiest manner to try out Impala is to down load the Cloudera QuickStart VM and begin the Impala provider thru Cloudera Manager, then use impala-shell in a terminal window or the Impala Query UI in the Hue web interface.

To do overall performance checking out and strive out the management features for Impala on a cluster, you want to transport beyond the QuickStart VM with its virtualized unmarried-node environment. Ideally, download the Cloudera Manager software to installation the cluster, then installation the Impala software thru Cloudera Manager.

Question 2. Does Cloudera Offer A Vm For Demonstrating Impala?

Answer :

Cloudera gives an illustration VM called the QuickStart VM, available in VMWare, VirtualBox, and KVM formats. For extra facts, see the Cloudera QuickStart VM. After booting the QuickStart VM, many services are turned off by means of default; in the Cloudera Manager UI that appears automatically, turn on Impala and some other components that you need to strive out.

Apache Tapestry Interview Questions
Question 3. Where Can I Find Impala Documentation?

Answer :

Starting with Impala 1.Three.0, Impala documentation is incorporated with the CDH 5 documentation, similarly to the standalone Impala documentation to be used with CDH four. For CDH 5, the center Impala developer and administrator information remains in the associated Impala documentation portion. Information about Impala launch notes, set up, configuration, startup, and security is embedded in the corresponding CDH 5 publications.

New capabilities
Known and fixed issues
Incompatible modifications
Installing Impala
Upgrading Impala
Configuring Impala
Starting Impala
Security for Impala
CDH Version and Packaging Information
Question four. Where Can I Get Sample Data To Try?

Answer :

You can get scripts that produce statistics files and set up an environment for TPC-DS style benchmark checks from this Github repository. In addition to being useful for experimenting with performance, the tables are ideal to experimenting with many aspects of SQL on Impala: they include an excellent mixture of statistics sorts, information distributions, partitioning, and relational data suitable for be a part of queries.

Apache Tapestry Tutorial
Question 5. How Much Memory Is Required?

Answer :

Although Impala isn't an in-reminiscence database, when handling large tables and big result sets, you need to assume to devote a giant portion of physical memory for the impalad daemon. Recommended physical memory for an Impala node is 128 GB or better. If practical, dedicate about eighty% of physical reminiscence to Impala.

The quantity of memory required for an Impala operation relies upon on numerous elements:

The file layout of the table. Different file formats constitute the equal information in more or fewer records documents. The compression and encoding for every report layout may require a extraordinary quantity of temporary memory to decompress the records for evaluation.
Whether the operation is a SELECT or an INSERT. For example, Parquet tables require notably little reminiscence to query, due to the fact Impala reads and decompresses information in 8 MB chunks. Inserting into a Parquet desk is a extra memory-intensive operation due to the fact the data for every facts record (potentially masses of megabytes, relying at the fee of the PARQUET_FILE_SIZE query choice) is saved in reminiscence until encoded, compressed, and written to disk.
Whether the table is partitioned or not, and whether a question against a partitioned desk can take advantage of partition pruning.
Whether the final end result set is sorted through the ORDER BY clause. Each Impala node scans and filters a part of the total data, and applies the LIMIT to its very own part of the end result set. In Impala 1.4.0 and higher, if the kind operation calls for more reminiscence than is to be had on any unique host, Impala makes use of a transient disk work region to perform the type. The intermediate end result units are all despatched back to the coordinator node, which does the very last sorting and then applies the LIMIT clause to the final result set.
For example, if you execute the question:

choose * from giant_table order by some_column limit a thousand;

and your cluster has 50 nodes, then every of these 50 nodes will transmit a maximum of 1000 rows again to the coordinator node. The coordinator node wishes enough reminiscence to sort (LIMIT * cluster_size) rows, despite the fact that in the long run the final end result set is at maximum LIMIT rows, 1000 in this case.

Likewise, if you execute the question:

choose * from giant_table where test_val > a hundred order by some_column;

then each node filters out a set of rows matching the WHERE situations, types the effects (with out a size restrict), and sends the looked after intermediate rows lower back to the coordinator node. The coordinator node would possibly need considerable reminiscence to sort the very last end result set, and so might use a transient disk work region for that final phase of the question.

Whether the query carries any be a part of clauses, GROUP BY clauses, analytic functions, or DISTINCT operators. These operations all require a few in-reminiscence work areas that modify depending on the extent and distribution of statistics. In Impala 2.Zero and later, these types of operations utilize brief disk work regions if memory usage grows too massive to handle.
The length of the result set. When intermediate consequences are being exceeded around between nodes, the amount of data depends on the quantity of columns lower back via the question. For instance, it's miles greater memory-efficient to question simplest the columns which are clearly wanted within the end result set instead of constantly issuing SELECT *.
The mechanism by using which work is divided for a be part of query. You use the COMPUTE STATS announcement, and query pointers in the most difficult instances, to assist Impala pick out the most efficient execution plan. 
Apache Cassandra Interview Questions
Question 6. What Are The Main Features Of Impala?

Answer :

A large set of SQL statements, together with SELECT and INSERT, with joins, Subqueries in Impala SELECT Statements, and Impala Analytic Functions. Highly well matched with HiveQL, and also including a few dealer extensions. For more information.
Distributed, excessive-overall performance queries.
Using Cloudera Manager, you can set up and manage your Impala offerings. Cloudera Manager is the first-class way to get began with Impala to your cluster.
Using Hue for queries.
Appending and putting statistics into tables through the INSERT announcement. 
ODBC: Impala is certified to run in opposition to MicroStrategy and Tableau, with restrictions. For more facts, see Configuring Impala to Work with ODBC.
Querying facts saved in HDFS and HBase in a single question.
In Impala 2.2.0 and higher, querying statistics saved within the Amazon Simple Storage Service (S3).
Concurrent purchaser requests. Each Impala daemon can take care of multiple concurrent purchaser requests. The results on overall performance depend on your unique hardware and workload.
Kerberos authentication. For extra statistics.
Partitions. With Impala SQL, you may create partitioned tables with the CREATE TABLE statement, and add and drop partitions with the ALTER TABLE assertion. Impala additionally takes advantage of the partitioning present in Hive tables.
Question 7. What Features From Relational Databases Or Hive Are Not Available In Impala?

Answer :

Querying streaming statistics.
Deleting character rows. You delete facts in bulk by means of overwriting an entire desk or partition, or by means of losing a desk.
Indexing (now not currently). LZO-compressed textual content documents may be listed outdoor of Impala, as defined in Using LZO-Compressed Text Files.
Full text search on textual content fields. The Cloudera Search product is appropriate for this use case.
Custom Hive Serializer/Deserializer instructions (SerDes). Impala supports a set of common native report codecs that have integrated SerDes in CDH.
Checkpointing within a query. That is, Impala does now not keep intermediate results to disk throughout lengthy-strolling queries. Currently, Impala cancels a jogging question if any host on which that question is executing fails. When one or more hosts are down, Impala reroutes destiny queries to handiest use the available hosts, and Impala detects while the hosts come lower back up and starts offevolved using them once more. Because a query can be submitted through any Impala node, there's no single point of failure. In the future, we can recall including extra work allocation functions to Impala, so that a strolling query would complete even in the presence of host screw ups.
Encryption of information transmitted among Impala daemons.
Hive indexes.
Non-Hadoop information stores, including relational databases.
Apache Cassandra Tutorial Apache Spark Interview Questions
Question 8. Does Impala Support Generic Jdbc?

Answer :

Impala supports the HiveServer2 JDBC motive force.

Question 9. Is Avro Supported?

Answer :

Yes, Avro is supported. Impala has constantly been able to question Avro tables. You can use the Impala LOAD DATAstatement to load present Avro data files into a table. Starting with Impala 1.Four, you could create Avro tables with Impala. Currently, you continue to use the INSERT statement in Hive to replicate records from any other table into an Avro table.

Apache Solr Interview Questions
Question 10. How Do I Know How Many Impala Nodes Are In My Cluster?

Answer :

The Impala statestore maintains music of how many impalad nodes are currently to be had. You can see this information through the statestore internet interface. For instance, on the URL http://statestore_host:25010/metrics you would possibly see traces just like the following:

statestore.Live-backends:3

statestore.Live-backends.List:[host1:22000, host1:26000, host2:22000]

The wide variety of impalad nodes is the quantity of list gadgets regarding port 22000, in this example two. (Typically, this wide variety is one much less than the variety stated via the statestore.Live-backends line.) If an impalad node have become unavailable or got here lower back after an outage, the facts reported on this page could exchange appropriately.

Apache Solr Tutorial
Question eleven. Are Results Returned As They Become Available, Or All At Once When A Query Completes?

Answer :

Impala streams consequences whenever they're available, whilst feasible. Certain SQL operations (aggregation or ORDER BY) require all of the input to be ready before Impala can go back effects.

Apache Storm Interview Questions
Question 12. Why Does My Select Statement Fail?

Answer :

When a SELECT statement fails, the reason normally falls into one of the following classes:

A timeout due to a overall performance, potential, or network trouble affecting one precise node.
Excessive memory use for a be part of question, resulting in automatic cancellation of the question.
A low-stage trouble affecting how local code is generated on every node to deal with particular WHERE clauses within the question. For instance, a device training can be generated that is not supported by means of the processor of a positive node. If the mistake message within the log indicates the cause was an illegal practise, remember turning off native code era quickly, and attempting the question again.
Malformed enter statistics, such as a textual content facts file with an relatively lengthy line, or with a delimiter that doesn't in shape the man or woman precise within the FIELDS TERMINATED BY clause of the CREATE TABLE announcement.
Apache Tapestry Interview Questions
Question 13. Why Does My Insert Statement Fail?

Answer :

When an INSERT announcement fails, it is also the result of exceeding some restrict inside a Hadoop element, normally HDFS.

An INSERT right into a partitioned table may be a strenuous operation because of the opportunity of commencing many files and associated threads simultaneously in HDFS. Impala 1.1.1 includes some improvements to distribute the work more correctly, so that the values for every partition are written with the aid of a single node, in place of as a separate statistics record from every node.
Certain expressions within the SELECT part of the INSERT assertion can complicate the execution planning and result in an inefficient INSERT operation. Try to make the column information sorts of the supply and destination tables healthy up, as an example with the aid of doing ALTER TABLE ... REPLACE COLUMNS at the supply table if important. Try to avoid CASE expressions in the SELECT element, because they make the result values more difficult to are expecting than shifting a column unchanged or passing the column through a integrated characteristic.
Be organized to elevate some limits in the HDFS configuration settings, either briefly at some point of the INSERTor permanently in case you regularly run such INSERT statements as part of your ETL pipeline.
The aid utilization of an INSERT declaration can range relying at the document format of the destination desk. Inserting right into a Parquet desk is reminiscence-in depth, due to the fact the statistics for each partition is buffered in memory until it reaches 1 gigabyte, at which point the facts record is written to disk. Impala can distribute the paintings for an INSERT extra effectively while records are to be had for the source desk that is queried in the course of the INSERT statement.
Apache Storm Tutorial
Question 14. Does Impala Performance Improve As It Is Deployed To More Hosts In A Cluster In Much The Same Way That Hadoop Performance Does?

Answer :

Yes. Impala scales with the range of hosts. It is important to put in Impala on all the DataNodes within the cluster, because in any other case some of the nodes should do remote reads to retrieve records now not to be had for local reads. Data locality is an crucial architectural aspect for Impala performance.

Question 15. Is The Hdfs Block Size Reduced To Achieve Faster Query Results?

Answer :

No. Impala does now not make any changes to the HDFS or HBase records units.

The default Parquet block length is surprisingly massive (256 MB in Impala 2.Zero and later; 1 GB in earlier releases). You can manage the block size when developing Parquet documents the use of the PARQUET_FILE_SIZE question choice.

Apache Hive Interview Questions
Question sixteen. Does Impala Use Caching?

Answer :

Impala does now not cache table facts. It does cache some table and record metadata. Although queries may run quicker on next iterations because the data set changed into cached within the OS buffer cache, Impala does now not explicitly control this.

Impala takes gain of the HDFS caching function in CDH 5. You can designate which tables or walls are cached via the CACHED and UNCACHED clauses of the CREATE TABLE and ALTER TABLE statements. Impala also can take gain of facts this is pinned within the HDFS cache via the hdfscacheadmin command.

Apache Hive Tutorial
Question 17. What Are Good Use Cases For Impala As Opposed To Hive Or Mapreduce?

Answer :

Impala is properly-suitable to executing SQL queries for interactive exploratory analytics on massive statistics sets. Hive and MapReduce are suitable for terribly lengthy running, batch-oriented obligations such as ETL.

Apache Pig Interview Questions
Question 18. Is Mapreduce Required For Impala? Will Impala Continue To Work As Expected If Mapreduce Is Stopped?

Answer :

Impala does not use MapReduce at all.

Apache Cassandra Interview Questions
Question 19. Can Impala Be Used For Complex Event Processing?

Answer :

For instance, in an business surroundings, many dealers might also generate huge quantities of statistics. Can Impala be used to analyze this information, checking for remarkable modifications within the surroundings?

Complex Event Processing (CEP) is commonly executed through dedicated movement-processing structures. Impala is not a movement-processing machine, as it maximum carefully resembles a relational database.

Apache Pig Tutorial
Question 20. Is Impala Intended To Handle Real Time Queries In Low-latency Applications Or Is It For Ad Hoc Queries For The Purpose Of Data Exploration?

Answer :

Ad-hoc queries are the primary use case for Impala. We assume it being used in many different situations wherein low-latency is required. Whether Impala is suitable for any unique use-case depends at the workload, records length and query volume. 

Apache Flume Interview Questions
Question 21. How Does Impala Compare To Hive And Pig?

Answer :

Impala isn't the same as Hive and Pig because it uses its own daemons which can be unfold throughout the cluster for queries. Because Impala does no longer rely upon MapReduce, it avoids the startup overhead of MapReduce jobs, allowing Impala to return consequences in real time.

Question 22. Can I Do Transforms Or Add New Functionality?

Answer :

Impala provides support for UDFs in Impala 1.2. You can write your own functions in C++, or reuse present Java-primarily based Hive UDFs. The UDF help includes scalar features and user-described combination features (UDAs). User-defined desk features (UDTFs) aren't presently supported.

Impala does no longer presently assist an extensible serialization-deserialization framework (SerDes), and so including more functionality to Impala isn't as honest as for Hive or Pig.

Apache Flume Tutorial
Question 23. Can Any Impala Query Also Be Executed In Hive?

Answer :

Yes. There are a few minor variations in how a few queries are handled, however Impala queries can also be completed in Hive. Impala SQL is a subset of HiveQL, with some purposeful boundaries along with transforms.

Apache Kafka Interview Questions
Question 24. Can I Use Impala To Query Data Already Loaded Into Hive And Hbase?

Answer :

There are no extra steps to allow Impala to question tables managed via Hive, whether they're saved in HDFS or HBase. Make sure that Impala is configured to get admission to the Hive metastore effectively and you need to be ready to head. Keep in mind that impalad, by default, runs because the impala consumer, so that you may want to adjust a few report permissions depending on how strict your permissions are currently.

Apache Spark Interview Questions
Question 25. Is Hive An Impala Requirement?

Answer :

The Hive metastore carrier is a requirement. Impala shares the equal metastore database as Hive, permitting Impala and Hive to get right of entry to the same tables transparently.

Hive itself is non-compulsory, and does now not want to be installed on the identical nodes as Impala. Currently, Impala supports a greater variety of study (query) operations than write (insert) operations; you operate Hive to insert information into tables that use certain record codecs.

Apache Kafka Tutorial
Question 26. Is Impala Production Ready?

Answer :

Impala has completed its beta launch cycle, and the 1.0, 1.1, and 1.2 GA releases are manufacturing ready. The 1.1.X series consists of extra security functions for authorization, an crucial requirement for production use in lots of agencies. The 1.2.X series includes crucial overall performance functions, specifically for huge be part of queries. Some Cloudera clients are already the use of Impala for large workloads.

The Impala 1.Three.Zero and better releases are bundled with corresponding tiers of CDH five. The quantity of latest features grows with each release. 

Apache Ant Interview Questions
Question 27. How Do I Configure Hadoop High Availability (ha) For Impala?

Answer :

You can set up a proxy server to relay requests to and fro to the Impala servers, for load balancing and excessive availability. 

Apache Solr Interview Questions
Question 28. What Happens If There Is An Error In Impala?

Answer :

There isn't always a unmarried factor of failure in Impala. All Impala daemons are completely able to cope with incoming queries. If a machine fails however, all queries with fragments going for walks on that machine will fail. Because queries are anticipated to return quickly, you can just rerun the query if there may be a failure. 

The longer answer: Impala need to be able to hook up with the Hive metastore. Impala aggressively caches metadata so the metastore host have to have minimal load. Impala is predicated at the HDFS NameNode, and, in CDH4, you may configure HA for HDFS. Impala also has centralized services, called the statestore andcatalog offerings, that run on one host most effective. Impala continues to execute queries if the statestore host is down, however it's going to not get country updates. For example, if a host is introduced to the cluster even as the statestore host is down, the existing instances of impalad walking on the opposite hosts will not find out about this new host. Once the statestore system is restarted, all of the statistics it serves is robotically reconstructed from all strolling Impala daemons.

Apache Ant Tutorial
Question 29. What Is The Maximum Number Of Rows In A Table?

Answer :

There isn't any defined most. Some clients have used Impala to question a desk with over one thousand billion rows.

Apache Camel Interview Questions
Question 30. On Which Hosts Does Impala Run?

Answer :

Cloudera strongly recommends going for walks the impalad daemon on every DataNode for desirable performance. Although this topology isn't a difficult requirement, if there are data blocks with no Impala daemons jogging on any of the hosts containing replicas of these blocks, queries related to that data could be very inefficient. In that case, the information must be transmitted from one host to another for processing by "remote reads", a situation Impala usually tries to keep away from. 

Question 31. How Are Joins Performed In Impala?

Answer :

By default, Impala automatically determines the maximum efficient order wherein to sign up for tables using a price-primarily based approach, primarily based on their usual size and range of rows. (This is a new characteristic in Impala 1.2.2 and better.) The COMPUTE STATS declaration gathers data about each table that is vital for efficient join performance. Impala chooses among  strategies for be part of queries, known as "broadcast joins" and "partitioned joins". 

Apache Tajo Tutorial
Question 32. How Does Impala Process Join Queries For Large Tables?

Answer :

Impala utilizes multiple strategies to permit joins between tables and end result units of various sizes. When joining a large table with a small one, the information from the small desk is transmitted to each node for intermediate processing. When becoming a member of  huge tables, the facts from one of the tables is split into pieces, and each node techniques best decided on pieces. 

Apache Tajo Interview Questions
Question 33. What Is Impala's Aggregation Strategy?

Answer :

Impala presently only helps in-memory hash aggregation. In Impala 2.Zero and higher, if the reminiscence necessities for a be part of or aggregation operation exceed the memory limit for a particular host, Impala uses a temporary paintings vicinity on disk to assist the question complete successfully.

Apache Storm Interview Questions
Question 34. How Is Impala Metadata Managed?

Answer :

Impala uses two pieces of metadata: the catalog records from the Hive metastore and the file metadata from the NameNode. Currently, this metadata is lazily populated and cached whilst an impalad needs it to plot a question.

The REFRESH assertion updates the metadata for a particular table after loading new statistics thru Hive. The INVALIDATE METADATA Statement statement refreshes all metadata, in order that Impala acknowledges new tables or other DDL and DML changes executed through Hive.

In Impala 1.2 and better, a committed catalogd daemon declares metadata modifications due to Impala DDL or DML statements to all nodes, lowering or removing the need to use the REFRESH and INVALIDATE METADATAstatements.

Question 35. What Load Do Concurrent Queries Produce On The Namenode?

Answer :

The load Impala generates may be very much like MapReduce. Impala contacts the NameNode during the making plans segment to get the report metadata (this is best run at the host the query turned into sent to). Every impaladwill read documents as part of normal processing of the question.

Question 36. How Does Impala Achieve Its Performance Improvements?

Answer :

These are the main factors in the overall performance of Impala as opposed to that of other Hadoop components and associated technologies.

Impala avoids MapReduce. While MapReduce is a first-rate standard parallel processing model with many blessings, it isn't designed to execute SQL. Impala avoids the inefficiencies of MapReduce in these methods:

Impala does now not materialize intermediate consequences to disk. SQL queries often map to more than one MapReduce jobs with all intermediate data units written to disk.
Impala avoids MapReduce begin-up time. For interactive queries, the MapReduce start-up time will become very sizeable. Impala runs as a carrier and basically has no start-up time.
Impala can greater evidently disperse question plans as a substitute of getting to healthy them into a pipeline of map and reduce jobs. This enables Impala to parallelize multiple ranges of a question and keep away from overheads which includes sort and shuffle when needless.

Impala uses a more green execution engine by taking benefit of modern hardware and technology:

Impala generates runtime code. Impala uses LLVM to generate meeting code for the query that is being run. Individual queries do now not ought to pay the overhead of running on a system that needs so that it will execute arbitrary queries.
Impala makes use of to be had hardware commands when feasible. Impala makes use of the supplemental SSE3 (SSSE3) instructions that can provide terrific speedups in a few instances. (Impala 2.Zero and a couple of.1 required the SSE4.1 practise set; Impala 2.2 and higher loosen up the limit again so most effective SSSE3 is required.)
Impala makes use of better I/O scheduling. Impala is privy to the disk vicinity of blocks and is capable of agenda the order to procedure blocks to maintain all disks busy.
Impala is designed for performance. A lot of time has been spent in designing Impala with sound performance-oriented basics, consisting of tight internal loops, inlined function calls, minimum branching, better use of cache, and minimum reminiscence usage.
Apache Hive Interview Questions
Question 37. What Happens When The Data Set Exceeds Available Memory?

Answer :

Currently, if the memory required to process intermediate results on a node exceeds the amount available to Impala on that node, the question is cancelled. You can adjust the reminiscence to be had to Impala on every node, and you may great-song the be part of method to reduce the memory required for the largest queries. We do plan on helping external joins and sorting in the destiny.

Keep in thoughts although that the reminiscence utilization isn't always immediately primarily based on the input information set size. For aggregations, the memory utilization is the wide variety of rows after grouping. For joins, the memory usage is the mixed length of the tables apart from the largest table, and Impala can use be a part of strategies that divide up large joined tables the various various nodes in preference to transmitting the entire table to every node.

Question 38. What Are The Most Memory-extensive Operations?

Answer :

If a question fails with an error indicating "memory restriction passed", you would possibly suspect a memory leak. The problem ought to clearly be a query this is structured in a manner that reasons Impala to allocate extra memory than you count on, surpassed the memory allocated for Impala on a specific node. Some examples of query or table structures which are specially reminiscence-in depth are:

INSERT statements the usage of dynamic partitioning, into a desk with many different walls. (Particularly for tables the use of Parquet layout, in which the records for every partition is held in reminiscence till it reaches the overall block length in size before it is written to disk.) Consider breaking apart such operations into numerous exclusive INSERT statements, as an instance to load facts 365 days at a time rather than for all years right away.
GROUP BY on a unique or excessive-cardinality column. Impala allocates a few handler structures for each unique value in a GROUP BY question. Having thousands and thousands of different GROUP BY values could exceed the reminiscence limit.
Queries concerning very wide tables, with lots of columns, specifically with many STRING columns. Because Impala permits a STRING value to be as much as 32 KB, the intermediate outcomes during such queries should require giant memory allocation.
Question 39. When Does Impala Hold On To Or Return Memory?

Answer :

Impala allocates reminiscence the usage of tcmalloc, a memory allocator that is optimized for excessive concurrency. Once Impala allocates reminiscence, it keeps that reminiscence reserved to use for destiny queries. Thus, it's far everyday for Impala to expose excessive memory usage while idle. If Impala detects that it's far approximately to exceed its reminiscence restriction (described via the -mem_limit startup option or the MEM_LIMIT query alternative), it deallocates memory not wanted with the aid of the modern queries.

When issuing queries via the JDBC or ODBC interfaces, ensure to call the suitable near method afterwards. Otherwise, some reminiscence associated with the question isn't always freed.

Question forty. Is There An Update Statement?

Answer :

Impala does now not presently have an UPDATE assertion, which could generally be used to change a unmarried row, a small organization of rows, or a specific column. The HDFS-primarily based documents used by traditional Impala queries are optimized for bulk operations across many megabytes of facts at a time, making traditional UPDATEoperations inefficient or impractical.

You can use the subsequent techniques to gain the same goals because the familiar UPDATE declaration, in a way that preserves efficient report layouts for next queries:

Replace the entire contents of a table or partition with updated facts which you have already staged in a one-of-a-kind place, both the usage of INSERT OVERWRITE, LOAD DATA, or manual HDFS document operations followed via a REFRESH statement for the table. Optionally, you can use built-in functions and expressions inside the INSERTstatement to convert the copied statistics in the same manner you'll typically do in an UPDATE assertion, for instance to turn a combined-case string into all uppercase or all lowercase.
To replace a unmarried row, use an HBase table, and difficulty an INSERT ... VALUES assertion using the identical key because the authentic row. Because HBase handles reproduction keys by using simplest returning the modern-day row with a particular key price, the newly inserted row effectively hides the previous one.
Apache Pig Interview Questions
Question forty one. Can Impala Do User-defined Functions (udfs)?

Answer :

Impala 1.2 and better does help UDFs and UDAs. You can either write native Impala UDFs and UDAs in C++, or reuse UDFs (but not UDAs) at the beginning written in Java to be used with Hive. 

Question forty two. Why Do I Have To Use Refresh And Invalidate Metadata, What Do They Do?

Answer :

In Impala 1.2 and higher, there may be plenty much less need to apply the REFRESH and INVALIDATE METADATA statements:

The new impala-catalog service, represented with the aid of the catalogd daemon, declares the results of Impala DDL statements to all Impala nodes. Thus, if you do a CREATE TABLE declaration in Impala while related to 1 node, you do now not want to do INVALIDATE METADATA before issuing queries thru a exclusive node.
The catalog provider only recognizes changes made via Impala, so that you ought to nonetheless issue a REFRESHstatement if you load data thru Hive or by means of manipulating documents in HDFS, and you have to problem an INVALIDATE METADATA announcement if you create a table, modify a desk, upload or drop walls, or do different DDL statements in Hive.
Because the catalog service announces the effects of REFRESH and INVALIDATE METADATA statements to all nodes, inside the cases wherein you do nonetheless need to trouble those statements, you could try this on a single node instead of on every node, and the changes may be mechanically recognized across the cluster, making it more handy to load stability with the aid of issuing queries through arbitrary Impala nodes instead of constantly using the same coordinator node.
Apache Flume Interview Questions
Question forty three. Why Is Space Not Freed Up When I Issue Drop Table?

Answer :

Impala deletes statistics files when you trouble a DROP TABLE on an internal desk, however now not an outside one. By default, the CREATE TABLE assertion creates internal tables, wherein the documents are managed with the aid of Impala. An outside table is created with a CREATE EXTERNAL TABLE declaration, where the documents reside in a place outside the control of Impala. Issue a DESCRIBE FORMATTED statement to test whether a table is inner or external. The keyword MANAGED_TABLE indicates an internal desk, from which Impala can delete the data files. The keyword EXTERNAL_TABLE shows an external desk, wherein Impala will leave the information documents untouched while you drop the table.

Even when you drop an internal table and the documents are eliminated from their authentic region, you may not get the hard drive space lower back straight away. By default, documents which can be deleted in HDFS cross into a special garbage can listing, from which they're purged after a time frame (via default, 6 hours). For history statistics on the garbage can mechanism.

Question 44. Is There A Dual Table?

Answer :

You is probably used to walking queries against a single-row desk named DUAL to try out expressions, integrated functions, and UDFs. Impala does no longer have a DUAL desk. To achieve the identical result, you may issue a SELECTstatement with none table call:

choose 2+2;
pick substr('howdy',2,1);
select pow(10,6);

Question 45. How Do I Load A Big Csv File Into A Partitioned Table?

Answer :

To load a statistics file right into a partitioned desk, whilst the statistics document consists of fields like 12 months, month, and so forth that correspond to the partition key columns, use a two-degree procedure. First, use the LOAD DATA or CREATE EXTERNAL TABLE statement to deliver the statistics into an unpartitioned text desk. Then use an INSERT ... SELECT statement to duplicate the facts from the unpartitioned table to a partitioned one. Include a PARTITION clause within the INSERTstatement to specify the partition key columns. 

Question forty six. Can I Do Insert ... Select * Into A Partitioned Table?

Answer :

When you operate the INSERT ... SELECT * syntax to duplicate facts into a partitioned desk, the columns corresponding to the partition key columns must appear final within the columns lower back through the SELECT *. You can create the table with the partition key columns described ultimate. Or, you can use the CREATE VIEW statement to create a view that reorders the columns: positioned the partition key columns final, then do the INSERT ... SELECT * from the view.

Question forty seven. What Kinds Of Impala Queries Or Data Are Best Suited For Hbase?

Answer :

HBase tables are perfect for queries in which generally you would use a key-price shop. That is, in which you retrieve a unmarried row or a few rows, by using checking out a unique unique key column the use of the = or IN operators.

HBase tables are not appropriate for queries that produce big end result units with hundreds of rows. HBase tables are also now not suitable for queries that perform full desk scans because the WHERE clause does no longer request specific values from the precise key column.

Use HBase tables for information that is inserted one row or a few rows at a time, which include by means of the INSERT ... VALUESsyntax. Loading facts piecemeal like this into an HDFS-subsidized desk produces many tiny documents, that's a completely inefficient format for HDFS records files.

If the dearth of an UPDATE assertion in Impala is a trouble for you, you may simulate single-row updates by doing an INSERT ... VALUES declaration using an existing value for the key column. The antique row fee is hidden; most effective the brand new row cost is visible by way of queries.

HBase tables are regularly extensive (containing many columns) and sparse (with most column values NULL). For example, you might document loads of different statistics factors for each user of a web service, which includes whether or not the person had registered for a web sport or enabled specific account capabilities. With Impala and HBase, you may look up all of the records for a particular patron efficiently in a unmarried question. For any given consumer, most of those columns is probably NULL, due to the fact an ordinary patron might not employ maximum features of a web carrier.




CFG