CrowdforGeeks | Build Skills with Online Courses from Top Institutions

Big Data and Hadoop Interview Questions and Answers

Q1. What are real-time enterprise applications of Hadoop?

Ans: Hadoop, well referred to as Apache Hadoop, is an open-supply software platform for scalable and distributed computing of massive volumes of records. It presents rapid, high overall performance and price-powerful analysis of structured and unstructured facts generated on digital systems and within the employer. It is utilized in almost all departments and sectors nowadays.Some of the times where Hadoop is used:

Managing traffic on streets.

Streaming processing.

Content Management and Archiving Emails.

Processing Rat Brain Neuronal Signals the usage of a Hadoop Computing Cluster.

Fraud detection and Prevention.

Advertisements Targeting Platforms are using Hadoop to seize and examine click circulate, transaction, video and social media records.

Managing content, posts, photos and films on social media systems.

Analyzing purchaser information in actual-time for improving enterprise overall performance.

Public area fields inclusive of intelligence, defense, cyber safety and scientific studies.

Financial businesses are the use of Big Data Hadoop to reduce danger, examine fraud styles, become aware of rogue investors, extra exactly goal their advertising and marketing campaigns based totally on consumer segmentation, and improve client satisfaction.

Getting access to unstructured facts like output from scientific devices, physician’s notes, lab consequences, imaging reviews, medical correspondence, clinical data, and financial statistics.

Q2. Compare Hadoop & Spark.

Ans:

Criteria Hadoop Spark

Dedicated storage HDFS None

Speed of processing average first-rate

Libraries Separate gear to be had Spark Core, SQL, Streaming, MLlib, GraphX

Q3. How is Hadoop different from different parallel computing structures?

Ans: Hadoop is a disbursed record device, which helps you to save and deal with large quantity of data on a cloud of machines, dealing with facts redundancy. The number one advantage is that considering that statistics is stored in several nodes, it's far better to method it in allotted way. Each node can procedure the records stored on it in preference to spending time in transferring it over the community.

On the contrary, in Relational database computing system, you could query records in real-time, but it isn't green to shop records in tables, statistics and columns while the facts is big.

Hadoop also offers a scheme to construct a Column Database with Hadoop HBase, for runtime queries on rows.

Q4. What all modes Hadoop can be run in?

Ans: Hadoop can run in 3 modes:

Standalone Mode: Default mode of Hadoop, it makes use of local record stystem for input and output operations. This mode is especially used for debugging motive, and it does now not guide the use of HDFS. Further, in this mode, there's no custom configuration required for mapred-site.Xml, middle-site.Xml, hdfs-site.Xml documents. Much faster whilst as compared to different modes.

Pseudo-Distributed Mode (Single Node Cluster): In this situation, you need configuration for all the 3 files mentioned above. In this case, all daemons are running on one node and therefore, each Master and Slave node are the same.

Fully Distributed Mode (Multiple Cluster Node): This is the manufacturing phase of Hadoop (what Hadoop is known for) where facts is used and dispensed across several nodes on a Hadoop cluster. Sepa charge nodes are allocated as Master and Slave.

Q5. Explain the essential distinction between HDFS block and InputSplit.

Ans: In easy terms, block is the physical representation of statistics whilst break up is the logical illustration of information gift inside the block. Split acts a s an middleman between block and mapper.

Suppose we've two blocks:

Block 1: myTectra

Block 2: my Tect

Now, considering the map, it'll read first block from my till ll, however does now not recognise how to process the second one block on the identical time. Here comes Split into play, with the intention to shape a logical organization of Block1 and Block 2 as a single block.

It then paperwork key-value pair the use of inputformat and statistics reader and sends map for similarly processing With inputsplit, when you have restrained sources, you can increase the split length to restrict the quantity of maps. For instance, if there are 10 blocks of 640MB (64MB every) and there are limited sources, you may assign ‘cut up size’ as 128MB. This will shape a logical institution of 128MB, with most effective five maps executing at a time.

However, if the ‘split length’ assets is about to fake, entire record will shape one inputsplit and is processed via single map, consuming extra time while the report is larger.

Q6. What is distributed cache and what are its advantages?

Ans: Distributed Cache, in Hadoop, is a service via MapReduce framework to cache documents while wished. Once a file is cached for a specific task, hadoop will make it to be had on each data node both in gadget and in reminiscence, wherein map and reduce duties are executing.Later, you may effortlessly access and read the cache record and populate any series (like array, hashmap) to your code.

Benefits of the usage of disbursed cache are:

It distributes simple, examine only text/information files and/or complicated kinds like jars, files and others. These documents are then un-archived at the slave node.

Distributed cache tracks the amendment timestamps of cache documents, which notifies that the documents ought to not be modified until a job is executing presently.

Q7. Explain the difference between NameNode, Checkpoint NameNode and BackupNode.

Ans:

NameNode is the center of HDFS that manages the metadata – the data of what file maps to what block locations and what blocks are stored on what datanode. In simple terms, it’s the statistics about the records being stored. NameNode helps a directory tree-like structure which include all of the files found in HDFS on a Hadoop cluster. It makes use of following documents for namespace:

fsimage report- It maintains song of the modern-day checkpoint of the namespace.

Edits file-It is a log of changes that have been made to the namespace in view that checkpoint.

Checkpoint NameNode has the equal directory shape as NameNode, and creates checkpoints for namespace at normal intervals by using downloading the fsimage and edits file and margining them within the neighborhood listing. The new photograph after merging is then uploaded to NameNode.

There is a comparable node like Checkpoint, normally known as Secondary Node, however it does no longer support the ‘add to NameNode’ functionality.

Backup Node presents similar capability as Checkpoint, implementing synchronization with NameNode. It keeps an up-to-date in-memory copy of record machine namespace and doesn’t require getting hold of changes after normal periods. The backup node desires to save the current kingdom in-reminiscence to an image record to create a brand new checkpoint.

Q8. What are the most common Input Formats in Hadoop?

There are three maximum commonplace enter codecs in Hadoop:

Text Input Format: Default enter format in Hadoop.

Key Value Input Format: used for simple text files in which the files are damaged into lines

Sequence File Input Format: used for reading documents in sequence

Q9. Define DataNode and how does NameNode tackle DataNode failures?

Ans: DataNode shops records in HDFS; it's far a node wherein actual information resides in the file device. Each datanode sends a heartbeat message to inform that it is alive. If the namenode does noit obtain a message from datanode for 10 minutes, it considers it to be lifeless or out of region, and starts offevolved replication of blocks that have been hosted on that statistics node such that they are hosted on some other statistics node.A BlockReport carries listing of all blocks on a DataNode. Now, the gadget begins to copy what have been stored in lifeless DataNode.

The NameNode manages the replication of facts blocksfrom one DataNode to different. In this manner, the replication facts transfers at once between DataNode such that the facts never passes the NameNode.

Q10. What are the middle strategies of a Reducer?

Ans: The 3 center methods of a Reducer are: setup(): this method is used for configuring diverse parameters like input records size, disbursed cache. Public void setup (context) lessen(): coronary heart of the reducer constantly called once in step with key with the related decreased assignment public void reduce(Key, Value, context) cleanup(): this method is known as to clean transient files, best once at the stop of the venture public void cleanup (context)

Q11. What is SequenceFile in Hadoop?

Ans: Extensively used in MapReduce I/O formats, SequenceFile is a flat document containing binary key/value pairs. The map outputs are saved as SequenceFile internally. It affords Reader, Writer and Sorter instructions. The 3 SequenceFile codecs are: Uncompressed key/price records. Record compressed key/value records – only ‘values’ are compressed right here. Block compressed key/price statistics – both keys and values are gathered in ‘blocks’ one at a time and compressed. The length of the ‘block’ is configurable.

Q12. What is Job Tracker function in Hadoop?

Ans: Job Tracker’s primary function is aid control (coping with the venture trackers), monitoring resource availability and challenge existence cycle management (tracking the taks progress and fault tolerance). It is a method that runs on a separate node, no longer on a DataNode often. Job Tracker communicates with the NameNode to discover records vicinity. Finds the satisfactory Task Tracker Nodes to execute obligations on given nodes. Monitors person Task Trackers and submits the overall process returned to the client. It tracks the execution of MapReduce workloads local to the slave node.

Q13. What is using RecordReader in Hadoop?

Ans: Since Hadoop splits facts into diverse blocks, RecordReader is used to examine the slit data into unmarried report. For instance, if our enter information is break up like: Row1: Welcome to Row2: Intellipaat It may be read as “Welcome to Intellipaat” the usage of RecordReader.

Q14. What is Speculative Execution in Hadoop?

Ans: One challenge of Hadoop is that by dispensing the duties on numerous nodes, there are possibilities that few slow nodes limit the relaxation of the program. Tehre are various reasons for the obligations to be gradual, which might be on occasion not easy to discover. Instead of figuring out and fixing the sluggish-strolling duties, Hadoop attempts to come across whilst the assignment runs slower than anticipated and then launches different equal task as backup. This backup mechanism in Hadoop is Speculative Execution. It creates a replica undertaking on some other disk. The equal enter can be processed more than one instances in parallel. When most duties in a activity comes to final touch, the speculative execution mechanism schedules reproduction copies of remaining obligations (that are slower) throughout the nodes which can be free currently. When those duties end, it's miles intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to stop the ones obligations and reject their output. Speculative execution is by default authentic in Hadoop. To disable, set mapred.Map.Tasks.Speculative.Execution and mapred.Reduce.Obligations.Speculative.Execution JobConf alternatives to false.

Q15. What takes place in case you try to run a Hadoop activity with an output directory that is already present?

Ans: It will throw an exception saying that the output record listing already exists.

To run the MapReduce job, you want to make certain that the output directory does no longer exist before within the HDFS.

To delete the directory before walking the job, you can use shell:Hadoop fs –rmr /course/to/your/output/Or thru the Java API: FileSystem.Getlocal(conf).Delete(outputDir, proper);

Q16. How are you able to debug Hadoop code?

Ans: First, take a look at the listing of MapReduce jobs presently strolling. Next, we want to look that there are no orphaned jobs walking; if sure, you need to determine the area of RM logs.

Run: “playstation –ef look for log listing inside the displayed result. Find out the job-id from the displayed listing and take a look at if there may be any errors message related to that task.

On the premise of RM logs, perceive the worker node that became involved in execution of the task.

Now, login to that node and run – “playstation grep –iNodeManager”

Examine the Node Manager log. The majority of mistakes come from user stage logs for every map-reduce process.

Q17. How to configure Replication Factor in HDFS?

Ans: hdfs-web site.Xml is used to configure HDFS. Changing the dfs.Replication belongings in hdfs-site.Xml will alternate the default replication for all documents placed in HDFS.

You also can alter the replication factor on a per-record basis using the

Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

you can additionally trade the replication factor of all of the files under a listing.

[training@localhost ~]$ hadoopfs –setrep –w three -R /my/dir

Go through Hadoop Training to study Replication Factor In HDFS now!

Q18. How to compress mapper output but no longer the reducer output?

Ans: To acquire this compression, you must set:

conf.Set("mapreduce.Map.Output.Compress", true)

conf.Set("mapreduce.Output.Fileoutputformat.Compress", false)

Q19. What is the distinction among Map Side be a part of and Reduce Side Join?

Ans: Map side Join at map facet is accomplished records reaches the map. You need a strict shape for outlining map side join. On the opposite hand, Reduce side Join (Repartitioned Join) is less complicated than map aspect be a part of because the input datasets need not be established. However, it is much less green because it will ought to go through kind and shuffle stages, coming with network overheads.

Q20. How can you switch information from Hive to HDFS?

Ans: By writing the query:

hive> insert overwrite listing '/' choose * from emp;

You can write your question for the information you need to import from Hive to HDFS. The output you receive can be stored in element documents within the unique HDFS route.

Q21. What agencies use Hadoop, any concept?

Ans: Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine makes use of Hadoop, Facebook – Developed Hive for analysis,Amazon,Netflix,Adobe,eBay,Spotify,Twitter,Adobe.

Q22. In Hadoop what's InputSplit?

Ans: It splits enter documents into chunks and assign each break up to a mapper for processing.

Q23. Mention Hadoop center components?

Ans: Hadoop core additives encompass:

HDFS

MapReduce

Q24. What is NameNode in Hadoop?

Ans: NameNode in Hadoop is where Hadoop shops all of the file area data in HDFS. It is the grasp node on which job tracker runs and consists of metadata.

Q25. Mention what are the statistics components used by Hadoop?

Ans: Data additives used by Hadoop are:

Pig.

Hive.