CrowdforGeeks | Build Skills with Online Courses from Top Institutions

Top 100+ Hadoop Mapreduce Interview Questions And Answers

Question 1. What Is Hadoop Mapreduce ?

Answer :

MapReduce is a set of programs used to procedure or analyze large of information over a Hadoop cluster. It procedure the widespread quantity of the datasets parallelly throughout the clusters in a faulttolerant way throughout the Hadoop framework.

Question 2. Can You Elaborate About Mapreduce Job?

Answer :

Based at the configuration, the MapReduce Job first splits the input information into independent chunks called Blocks. These blocks processed by means of Map() and Reduce() features. First Map feature method the facts, then processed by means of lessen characteristic. The Framework takes care of kinds the Map outputs, scheduling the obligations.

Informatica Interview Questions
Question 3. Why Compute Nodes And The Storage Nodes Are The Same?

Answer :

Compute nodes for processing the statistics, Storage nodes for storing the statistics. By default Hadoop framework attempts to minimize the network wastage, to obtain that goal Framework follows the Data locality idea. The Compute code execute wherein the statistics is saved, so the facts node and compute node are the same.

Question 4. What Is The Configuration Object Importance In Mapreduce?

Answer :

It’s used to set/get of parameter name & price pairs in XML file.It’s used to initialize values, examine from external file and set as a fee parameter.Parameter values within the software usually overwrite with new values which are coming from outside configure documents.Parameter values acquired from Hadoop’s default values.

Informatica Tutorial
Question 5. Where Mapreduce Not Recommended?

Answer :

Mapreduce isn't advocated for Iterative kind of processing. It means repeat the output in a loop manner.To procedure Series of Mapreduce jobs, MapReduce not suitable. Every task persists information in local disk, alternatively load to some other job. It’s costly operation and not advocated.

Teradata Interview Questions
Question 6. What Is Namenode And It’s Responsibilities?

Answer :

Namenode is a logical daemon name for a particular node. It’s coronary heart of the complete Hadoop machine. Which shop the metadata in FsImage and get all block records in the form of Heartbeat.

Question 7. What Is Jobtracker’s Responsibility?

Answer :

Scheduling the task’s tasks on the slaves. Slaves execute the tasks as directed by the JobTracker. Monitoring the tasks, if failed, reexecute the failed obligations.

Teradata Tutorial Hadoop Interview Questions
Question eight. What Are The Jobtracker & Tasktracker In Mapreduce?

Answer :

MapReduce Framework includes a unmarried Job Tracker in step with Cluster, one Task Tracker in line with node. Usually A cluster has multiple nodes, so each cluster has single Job Tracker and a couple of TaskTrackers.JobTracker can time table the task and reveal the Task Trackers. If Task Tracker failed to execute duties, try to re-execute the failed duties.

TaskTracker follow the JobTracker’s instructions and execute the obligations. As a slave node, it file the job reputation to Master JobTracker within the form of Heartbeat.

Question 9. What Is Job Scheduling Importance In Hadoop Mapreduce?

Answer :

Scheduling is a systematic procedure of allocating assets inside the pleasant viable manner amongst a couple of obligations. Hadoop undertaking tracker performing many methods, sometime a particular process need to end fast and offer extra prioriety, to do it few activity schedulers come into the picture. Default Schedule is FIFO. Fair scheduling, FIFO and CapacityScheduler are most famous hadoop scheduling in hadoop.

Java Interview Questions
Question 10. When Used Reducer?

Answer :

To combine more than one mapper’s output used reducer. Reducer has three number one levels sort, shuffle and reduce. It’s feasible to method information without reducer, however used whilst the shuffle and kind is needed.

Hadoop Tutorial
Question eleven. What Is Replication Factor?

Answer :

A chew of records is saved in one-of-a-kind nodes with in a cluster called replication element. By default replication cost is 3, however it’s feasible to trade it. Automatically every file is split into blocks and unfold across the cluster.

Apache Pig Interview Questions
Question 12. Where The Shuffle And Sort Process Does?

Answer :

After Mapper generate the output temporary keep the intermediate statistics on the local File System. Usually this transient document configured at middlewebsite online.Xml in the Hadoop file. Hadoop Framework combination and kind this intermediate facts, then update into Hadoop to be processed by the Reduce feature. The Framework deletes this temporary records in the neighborhood system after Hadoop completes the activity.

Informatica Interview Questions
Question 13. Java Is Mandatory To Write Mapreduce Jobs?

Answer :

No, By default Hadoop implemented in JavaTM, but MapReduce packages want no longer be written in Java. Hadoop support Python, Ruby, C++ and different Programming languages. Hadoop Streaming API allows to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.Hadoop Pipes permits programmers to put into effect MapReduce packages by means of the usage of C++ packages.

Java Tutorial
Question 14. What Methods Can Controle The Map And Reduce Function’s Output?

Answer :

setOutputKeyClass() and setOutputValueClass()

If they may be distinctive, then the map output type may be set the use of the techniques.

SetMapOutputKeyClass() and setMapOutputValueClass()

Question 15. What Is The Main Difference Between Mapper And Reducer?

Answer :

Map approach is called separately for each key/value had been processed. It method input key/fee pairs and emits intermediate key/cost pairs.
Reduce approach is known as one at a time for every key/values listing pair. It system intermediate key/price pairs and emits final key/price pairs.
Both are initialize and known as before another method is called. Both don’t have any parameters and no output.
Machine getting to know Interview Questions
Question sixteen. Why Compute Nodes And The Storage Nodes Are Same?

Answer :

Compute nodes are logical processing units, Storage nodes are bodily garage units (Nodes). Both are going for walks in the identical node due to “facts locality” problem. As a result Hadoop limit the records network wastage and permits to process fast.

Hadoop MapReduce Tutorial
Question 17. What Is Difference Between Mapside Join And Reduce Side Join? Or When We Goes To Mapside Join And Reduce Join?

Answer :

Join multple tables in mapper aspect, called map facet join. Please be aware mapside be a part of must has strict format and looked after nicely. If dataset is smaller tables, goes via reducer phrase. Data need to partitioned properly.

Join the a couple of tables in reducer side called reduce aspect be part of. If you've got huge amount of information tables, making plans to enroll in both tables. One table is massive quantity of rows and columns, every other one has few number of tables best, goes through Rreduce aspect be part of. It’s the great way to join the multiple tables

NoSQL Interview Questions
Question 18. What Happen If Number Of Reducer Is 0?

Answer :

Number of reducer = zero also legitimate configuration in MapReduce. In this situation, No reducer will execute, so mapper output recall as output, Hadoop shop this information in separate folder.

Teradata Interview Questions
Question 19. When We Are Goes To Combiner? Why It Is Recommendable?

Answer :

Mappers and reducers are impartial they dont talk each different. When the functions which are commutative(a.B = b.A) and associative a.(b.C) = (a.B).C we goes to combiner to optimize the mapreduce procedure. Many mapreduce jobs are confined with the aid of the bandwidth, so via default Hadoop framework minimizes the records bandwidth network wastage. To reap it’s aim, Mapreduce permits person defined “Cominer characteristic” to run at the map output. It’s an MapReduce optimization approach, however it’s non-obligatory.

Apache Pig Tutorial
Question 20. What Is The Main Difference Between Mapreduce Combiner And Reducer?

Answer :

Both Combiner and Reducer are optionally available, however most frequently used in MapReduce. There are three primary variations which include:

combiner will get simplest one input from one Mapper. While Reducer gets more than one mappers from distinctive mappers.
If aggregation required used reducer, however if the feature follows commutative (a.B=b.A) and associative a.(b.C)= (a.B).C law, use combiner.
Input and output keys and values kinds should equal in combiner, but reducer can follows any type enter, any output layout.
HBase Interview Questions
Question 21. What Is Combiner?

Answer :

It’s a logical aggregation of key and cost pair produced with the aid of mapper. It’s reduces loads quantity of duplicated records transfer between nodes, so subsequently optimize the task overall performance. The framework decides whether or not combiner runs zero or a couple of instances. It’s now not appropriate in which imply feature happens.

Question 22. What Is Partition?

Answer :

After combiner and intermediate mapoutput the Partitioner controls the keys after kind and shuffle. Partitioner divides the intermediate data in line with the variety of reducers so that all the facts in a single partition receives done by using a unmarried reducer. It means every partition can done through only a unmarried reducer. If you name reducer, robotically partition called in reducer by way of routinely.

HBase Tutorial
Question 23. When We Goes To Partition?

Answer :

By default Hive reads entire dataset even the utility have a slice of statistics. It’s a bottleneck for mapreduce jobs. So Hive permits unique choice known as walls. When you are growing table, hive partitioning the table primarily based on requirement.

MongoDB Interview Questions
Question 24. What Are The Important Steps When You Are Partitioning Table?

Answer :

Don’t over partition the facts with too small partitions, it’s overhead to the namenode.

If dynamic partition, atleast one static partition ought to exist and set to strict mode by way of the use of given commands.

SET hive.Exec.Dynamic.Partition = real?
SET hive.Exec.Dynamic.Partition.Mode = nonstrict?

first load records into nonpartitioned table, then load such records into partitioned table. It’s now not feasible to load information from local to partitioned desk.
Insert overwrite table table_name partition(year) pick out * from nonpartitiontable?

Hadoop Interview Questions
Question 25. Can You Elaborate Mapreduce Job Architecture?

Answer :

First Hadoop programmer publish Mpareduce program to JobClient.

Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s inside the shape of Job_HadoopStartedtime_00001. It’s unique ID.

Once JobClient get hold of received Job ID replica the Job sources (task.Xml, activity.Jar) to File System (HDFS) and publish activity to JobTracker. JobTracker initiate Job and time table the activity.

Based on configuration, activity cut up the input splits and put up to HDFS. TaskTracker retrive the activity assets from HDFS and launch Child JVM. In this Child JVM, run the map and reduce duties and notify to the Job tracker the job popularity.

MongoDB Tutorial
Question 26. Why Task Tracker Launch Child Jvm?

Answer :

Most frequently, hadoop developer mistakenly publish incorrect jobs or having insects. If Task Tracker use existent JVM, it can interrupt the main JVM, so other obligations can also prompted. Where as toddler JVM if it’s looking to harm existent resources, TaskTracker kill that toddler JVM and retry or relaunch new child JVM.

Data Science R Interview Questions
Question 27. Why Jobclient, Job Tracker Submits Job Resources To File System?

Answer :

Data locality. Move opposition is inexpensive than shifting Data. So common sense/ opposition in Jar record and splits. So Where the information to be had, in File System Datanodes. So every sources reproduction in which the data to be had.

Java Interview Questions
Question 28. How Many Mappers And Reducers Can Run?

Answer :

By default Hadoop can run 2 mappers and a pair of reducers in one datanode. Additionally each node has 2 map slots and 2 reducer slots. It’s feasible to change this default values in Mapreduce.Xml in conf document.

Lucene Tutorial
Question 29. What Is Inputsplit?

Answer :

A chew of information processed with the aid of a single mapper known as InputSplit. In another phrases logical chew of data which processed through a unmarried mapper called Input break up, by default inputSplit = block Size.

Question 30. How To Configure The Split Value?

Answer :

By default block size = 64mb, but to method the statistics, job tracker cut up the facts. Hadoop architect use these formulation to recognize split length.

Cut up size = min (max_splitsize, max (block_size, min_split_size))?
split length = max(min_split_size, min (block_size, max_split, size))?
via default break up length = block length
Always No of splits = No of mappers.

Apply above components:

split size = Min (max_splitsize, max (sixty four, 512kB) // max _splitsize = depends on env, can also 1gb or 10gb break up size = min (10gb (allow expect), 64) split size = 64MB.
Cut up size = max(min_split_size, min (block_size, max_split, length))? break up size = max (512kb, min (sixty four, 10GB))? split length = max (512kb, 64)?split size = sixty four MB?
Question 31. How Much Ram Required To Process 64mb Data?

Answer :

Leg expect. 64 block size, machine take 2 mappers, 2 reducers, so sixty four*4 = 256 MB reminiscence and OS take atleast 30% more space so atleast 256 + eighty = 326MB Ram required to manner a bit of information.So on this way required more reminiscence to technique undependent procedure.

Question 32. What Is Difference Between Block And Split?

Answer :

Block: How a good deal bite statistics to saved in the reminiscence called block.
Split: how a good deal records to method the data known as cut up.
Question 33. Why Hadoop Framework Reads A File Parallel Why Not Sequential?

Answer :

To retrieve records quicker, Hadoop reads information parallel, the principle purpose it may access statistics quicker. While, writes in series, but now not parallel, the primary cause it would end result one node can be overwritten through different and where the second node. Parallel processing is independent, so there may be no relation among two nodes, if writes information in parallel, it’s not feasible wherein the subsequent bite of information has. For instance 100 MB facts write parallel, 64 MB one block any other block 36, if information writes parallel first block doesn’t recognize where the last information. So Hadoop reads parallel and write sequentially.

Apache Pig Interview Questions
Question 34. If I Am Change Block Size From 64 To 128, Then What Happen?

Answer :

Even you have changed block size no longer impact existent records. After modified the block length, every report chunked after 128 MB of block length. It manner old facts is in sixty four MB chunks, but new data saved in 128 MB blocks.

Question 35. What Is Issplitable()?

Answer :

By default this fee is real. It is used to break up the records in the input layout. If unstructured facts, it’s not recommendable to break up the information, so system entire document as a one break up. To do it first trade isSplitable() to fake.

Question 36. How Much Hadoop Allows Maximum Block Size And Minimum Block Size?

Answer :

Minimum: 512 bytes. It’s neighborhood OS record gadget block size. No one can decrease fewer than block size.
Maximum: Depends on environment. There isn't any higherbound.
Machine gaining knowledge of Interview Questions
Question 37. What Are The Job Resource Files?

Answer :

job.Xml and job.Jar are core assets to procedure the Job. Job Client copy the resources to the HDFS.

Question 38. What’s The Mapreduce Job Consists?

Answer :

MapReduce job is a unit of labor that client wants to be accomplished. It consists of input records, MapReduce application in Jar record and configuration putting in XML documents. Hadoop runs this activity via dividing it in special responsibilities with the assist of JobTracker

Question 39. What Is The Data Locality?

Answer :

Whereever the records is there technique the facts, computation/process the facts wherein the data to be had, this procedure referred to as statistics locality. “Moving Computation is Cheaper than Moving Data” to obtain this purpose follow information locality. It’s possible whilst the statistics is splittable, by default it’s true.

Question forty. What Is Speculative Execution?

Answer :

Hadoop run the manner in commodity hardware, so it’s feasible to fail the structures additionally has low memory. So if device failed, system additionally failed, it’s not recommendable.Speculative execution is a method performance optimization technique.Computation/common sense distribute to the a couple of systems and execute which gadget execute quick. By default this fee is actual. Now even the device crashed, not a problem, framework select common sense from other systems.

Eg: common sense disbursed on A, B, C, D systems, completed inside a time.

System A, System B, System C, System D systems completed 10 min, 8 minutes, 9 minutes 12 minutes concurrently. So keep in mind gadget B and kill remaining machine processes, framework take care to kill the other machine technique.

NoSQL Interview Questions
Question forty one. When We Goes To Reducer?

Answer :

When type and shuffle is needed then best goes to reducers otherwise no want partition. If clear out, no need to type and shuffle. So without reducer its viable to try this operation.

Question 42. What Is Chain Mapper?

Answer :

Chain mapper magnificence is a unique mapper class sets which run in a chain style inside a single map assignment. It way, one mapper input acts as every other mapper’s enter, in this way n quantity of mapper related in chain style.

HBase Interview Questions
Question forty three. How To Do Value Level Comparison?

Answer :

Hadoop can technique key degree evaluation simplest but not within the fee level comparison.

Question forty four. What Is Setup And Clean Up Methods?

Answer :

If you don’t no what's beginning and ending point/traces, it’s a lot difficult to solve the ones issues. Setup and easy up can clear up it. N wide variety of blocks, by way of default 1 mapper known as to every break up. Each split has one start and smooth up strategies. N range of methods, number of strains. Setup is initialize task assets.

The purpose of clean up is close the activity sources. Map is technique the information. Once remaining map is finished, cleanup is initialized. It Improves the data switch overall performance. All these block length evaluation can do in reducer as properly. If you have any key and value, compare one key price to some other key value use it. If you compare record stage used those setup and cleanup. It open once and process often and close once. So it shop a whole lot of network wastage at some stage in process.

Question forty five. How Many Slots Allocate For Each Task?

Answer :

By default every venture has 2 slots for mapper and a couple of slots for reducer. So every node has 4 slots to process the information.

Question 46. Why Tasktracker Launch Child Jvm To Do A Task? Why Not Use Existent Jvm?

Answer :

Sometime toddler threads currupt figure threads. It method because of programmer mistake entired MapReduce project distruped. So challenge tracker release a child JVM to method character mapper or tasker. If tasktracker use existent JVM, it'd damage main JVM. If any insects arise, tasktracker kill the kid process and relaunch any other baby JVM to do the identical undertaking. Usually mission tracker relaunch and retry the assignment four times.

Question 47. What Are The Main Components Of Mapreduce Job?

Answer :

Main Driver Class: imparting task configuration parameters
Mapper Class: need to make bigger org.Apache.Hadoop.Mapreduce.Mapper elegance and plays execution of map() method
Reducer Class: need to extend org.Apache.Hadoop.Mapreduce.Reducer magnificence
Question forty eight. What Main Configuration Parameters Are Specified In Mapreduce?

Answer :

The MapReduce programmers need to specify following configuration parameters to carry out the map and reduce jobs:

The input place of the process in HDFs.
The output area of the task in HDFS.
The input’s and output’s format.
The instructions containing map and reduce functions, respectively.
The .Jar document for mapper, reducer and driving force classes
Question 49. What Is Partitioner And Its Usage?

Answer :

Partitioner is but another important segment that controls the partitioning of the intermediate map-reduce output keys using a hash characteristic. The procedure of partitioning determines in what reducer, a key-price pair (of the map output) is despatched. The variety of walls is identical to the full variety of reduce jobs for the procedure.

Hash Partitioner is the default elegance available in Hadoop , which implements the following feature.Int getPartition(K key, V price, int numReduceTasks)

The feature returns the partition wide variety the usage of the numReduceTasks is the range of constant reducers.

Question 50. What Is Identity Mapper?

Answer :

Identity Mapper is the default Mapper class supplied by using Hadoop. When no different Mapper elegance is described, Identify could be executed. It best writes the enter records into output and do not carry out and computations and calculations at the enter information. The elegance call is org.Apache.Hadoop.Mapred.Lib.IdentityMapper.

Question 51. What Is Recordreader In A Map Reduce?

Answer :

RecordReader is used to read key/fee pairs shape the InputSplit by means of converting the byte-oriented view and providing file-oriented view to Mapper.

Question 52. What Is Outputcommitter?

Answer :

OutPutCommitter describes the devote of MapReduce mission. FileOutputCommitter is the default available class available for OutputCommitter in MapReduce. It performs the subsequent operations:

Create brief output directory for the activity at some stage in initialization.
Then, it cleans the job as in removes brief output directory post task final touch.
Sets up the task transient output.
Identifies whether a mission needs commit. The commit is carried out if required.
JobSetup, JobCleanup and TaskCleanup are important responsibilities at some stage in output dedicate.
Question fifty three. What Are The Parameters Of Mappers And Reducers?

Answer :

The 4 parameters for mappers are:

LongWritable (input)
text (enter)
textual content (intermediate output)
IntWritable (intermediate output)
The 4 parameters for reducers are:

Text (intermediate output)
IntWritable (intermediate output)
Text (very last output)
IntWritable (very last output)
Question fifty four. What Is A “reducer” In Hadoop?

Answer :

In Hadoop, a reducer collects the output generated by way of the mapper, approaches it, and creates a very last output of its own.

Question 55. What Is A “map” In Hadoop?

Answer :

In Hadoop, a map is a segment in HDFS question solving. A map reads records from an enter region, and outputs a key value pair in keeping with the enter kind.

Question fifty six. Explain Jobconf In Mapreduce?

Answer :

It is a primary interface to outline a map-lessen job within the Hadoop for process execution. JobConf specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat implementations and different advanced job faets liek Comparators.

Question fifty seven. What Is Jobtracker?

Answer :

JobTracker is a Hadoop carrier used for the processing of MapReduce jobs in the cluster. It submits and tracks the jobs to particular nodes having records. Only one JobTracker runs on single Hadoop cluster on its personal JVM process. If JobTracker goes down, all the jobs halt.

Question 58. Explain Job Scheduling Through Jobtracker?

Answer :

JobTracker communicates with NameNode to discover facts vicinity and submits the work to TaskTracker node. The TaskTracker performs a prime role because it notifies the JobTracker for any job failure. It without a doubt is noted the heartbeat reporter reassuring the JobTracker that it's far nevertheless alive. Later, the JobTracker is answerable for the movements as in it is able to either resubmit the activity or mark a particular file as unreliable or blacklist it.

Question 59. What Is Sequencefileinputformat?

Answer :

A compressed binary output file format to read in series documents and extends the FileInputFormat.It passes information among output-enter (among output of one MapReduce activity to input of some other MapReduce activity)phases of MapReduce jobs.

Question 60. How To Set Mappers And Reducers For Hadoop Jobs?

Answer :

Users can configure JobConf variable to set wide variety of mappers and reducers.

Task.SetNumMaptasks()
activity.SetNumreduceTasks()