CrowdforGeeks | Build Skills with Online Courses from Top Institutions

Top 50 Hadoop Mapreduce Interview Questions

Q1. What Is The Main Difference Between Mapper And Reducer?

Map technique is referred to as one at a time for each key/value were processed. It process input key/price pairs and emits intermediate key/price pairs.

Reduce technique is referred to as one at a time for each key/values listing pair. It manner intermediate key/cost pairs and emits very last key/price pairs.

Both are initialize and referred to as earlier than another approach is referred to as. Both don’t have any parameters and no output.

Q2. How To Do Value Level Comparison?

Hadoop can process key level contrast handiest but no longer in the fee degree comparison.

Q3. What Main Configuration Parameters Are Specified In Mapreduce?

The MapReduce programmers need to specify following configuration parameters to perform the map and decrease jobs:

The input place of the job in HDFs.

The output vicinity of the job in HDFS.

The enter’s and output’s format.

The instructions containing map and reduce capabilities, respectively.

The .Jar file for mapper, reducer and driver training

Q4. What Are The Main Components Of Mapreduce Job?

Main Driver Class: providing activity configuration parameters

Mapper Class: ought to extend org.Apache.Hadoop.Mapreduce.Mapper elegance and plays execution of map() technique

Reducer Class: should increase org.Apache.Hadoop.Mapreduce.Reducer magnificence

Q5. What Is Combiner?

It’s a logical aggregation of key and fee pair produced with the aid of mapper. It’s reduces a lot quantity of duplicated records trfer among nodes, so in the end optimize the task overall performance. The framework makes a decision whether combiner runs 0 or multiple instances. It’s now not appropriate wherein suggest characteristic happens.

Q6. Why Compute Nodes And The Storage Nodes Are Same?

Compute nodes are logical processing devices, Storage nodes are bodily garage devices (Nodes). Both are jogging within the equal node due to “facts locality” issue. As a result Hadoop reduce the data community wastage and permits to procedure quick.

Q7. What Is Setup And Clean Up Methods?

If you don’t no what is beginning and finishing point/traces, it’s an awful lot tough to resolve the ones issues. Setup and smooth up can resolve it. N wide variety of blocks, by using default 1 mapper known as to every break up. Every cut up has one begin and smooth up methods. N range of methods, variety of traces. Setup is initialize task sources.

The cause of easy up is near the task assets. Map is process the records. Once closing map is completed, cleanup is initialized. It Improves the records trfer performance. All these block length contrast can do in reducer as nicely. If you have any key and value, evaluate one key price to every other key cost use it. If you compare file level used these setup and cleanup. It open as soon as and system normally and near once. So it shop lots of community wastage throughout manner.

Q8. When We Are Goes To Combiner? Why It Is Recommendable?

Mappers and reducers are independent they dont talk every other. When the functions which are commutative(a.B = b.A) and associative a.(b.C) = (a.B).C we is going to combiner to optimize the mapreduce technique. Many mapreduce jobs are confined via the bandwidth, so by using default Hadoop framework minimizes the records bandwidth community wastage. To reap it’s intention, Mapreduce lets in consumer defined “Cominer feature” to run on the map output. It’s an MapReduce optimization approach, but it’s optionally available.

Q9. What Is Difference Between Block And Split?

Block: How lots bite data to stored in the memory referred to as block.

Split: how a good deal records to method the facts known as split.

Q10. Why Jobclient, Job Tracker Submits Job Resources To File System?

Data locality. Move competition is inexpensive than transferring Data. So common sense/ competition in Jar report and splits. So Where the statistics available, in File System Datanodes. So each resources replica wherein the facts to be had.

Q11. Can You Elaborate Mapreduce Job Architecture?

First Hadoop programmer publish Mpareduce application to JobClient.

Job Client request the JobTracker to get Job id, Job tracker provide JobID, its’s within the shape of Job_HadoopStartedtime_000@It’s unique ID.

Once JobClient get hold of acquired Job ID reproduction the Job resources (activity.Xml, activity.Jar) to File System (HDFS) and put up activity to JobTracker. JobTracker initiate Job and time table the process.

Based on configuration, job break up the input splits and put up to HDFS. TaskTracker retrive the activity assets from HDFS and release Child JVM. In this Child JVM, run the map and decrease tasks and notify to the Job tracker the job repute.

Q12. Why Compute Nodes And The Storage Nodes Are The Same?

Compute nodes for processing the information, Storage nodes for storing the information. By default Hadoop framework attempts to reduce the network wastage, to attain that intention Framework follows the Data locality idea. The Compute code execute where the facts is saved, so the statistics node and compute node are the same.

Q13. What Is Recordreader In A Map Reduce?

RecordReader is used to study key/cost pairs form the InputSplit with the aid of converting the byte-orientated view and supplying record-oriented view to Mapper.

Q14. What Is Sequencefileinputformat?

A compressed binary output document format to examine in sequence files and extends the FileInputFormat.It passes records between output-input (among output of 1 MapReduce task to enter of some other MapReduce process)phases of MapReduce jobs.

Q15. What Is Jobtracker?

JobTracker is a Hadoop provider used for the processing of MapReduce jobs inside the cluster. It submits and tracks the roles to particular nodes having records. Only one JobTracker runs on single Hadoop cluster on its personal JVM process. If JobTracker goes down, all of the jobs halt.

Q16. Explain Jobconf In Mapreduce?

It is a number one interface to outline a map-lessen process in the Hadoop for process execution. JobConf specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat implementations and other superior job faets liek Comparators.

Q17. How To Set Mappers And Reducers For Hadoop Jobs?

Users can configure JobConf variable to set variety of mappers and reducers.

Process.SetNumMaptasks()

process.SetNumreduceTasks()

Q18. What Is Partitioner And Its Usage?

Partitioner is yet every other crucial section that controls the partitioning of the intermediate map-lessen output keys the usage of a hash function. The procedure of partitioning determines in what reducer, a key-price pair (of the map output) is despatched. The range of partitions is equal to the full wide variety of reduce jobs for the system.

Hash Partitioner is the default elegance to be had in Hadoop , which implements the following characteristic.Int getPartition(K key, V value, int numReduceTasks)

The characteristic returns the partition range the usage of the numReduceTasks is the quantity of fixed reducers.

Q19. If I Am Change Block Size From sixty four To 128, Then What Happen?

Even you have got changed block size no longer effect existent data. After modified the block length, every record chunked after 128 MB of block length. It me antique facts is in sixty four MB chunks, however new records stored in 128 MB blocks.

Q20. What Are The Job Resource Files?

Job.Xml and activity.Jar are middle assets to system the Job. Job Client reproduction the sources to the HDFS.

Q21. How Many Slots Allocate For Each Task?

By default each mission has 2 slots for mapper and 2 slots for reducer. So each node has 4 slots to technique the information.

Q22. When We Goes To Partition?

By default Hive reads complete dataset even the utility have a slice of information. It’s a bottleneck for mapreduce jobs. So Hive permits special alternative known as partitions. When you're growing desk, hive partitioning the desk primarily based on requirement.

Q23. What Is Replication Factor?

A bite of facts is stored in special nodes with in a cluster called replication component. By default replication price is 3, however it’s viable to alternate it. Automatically every file is break up into blocks and spread throughout the cluster.

Q24. Why Hadoop Framework Reads A File Parallel Why Not Sequential?

To retrieve information quicker, Hadoop reads statistics parallel, the main purpose it can access statistics quicker. While, writes in series, however now not parallel, the main cause it'd end result one node can be overwritten via other and where the second one node. Parallel processing is impartial, so there may be no relation among two nodes, if writes facts in parallel, it’s not viable in which the following chunk of statistics has. For instance one hundred MB statistics write parallel, sixty four MB one block any other block 36, if records writes parallel first block doesn’t understand where the closing records. So Hadoop reads parallel and write sequentially.

Q25. Can You Elaborate About Mapreduce Job?

Based on the configuration, the MapReduce Job first splits the enter facts into unbiased chunks called Blocks. These blocks processed with the aid of Map() and Reduce() features. First Map characteristic method the statistics, then processed by using reduce feature. The Framework takes care of sorts the Map outputs, scheduling the obligations.

Q26. What Is Outputcommitter?

OutPutCommitter describes the commit of MapReduce mission. FileOutputCommitter is the default to be had elegance available for OutputCommitter in MapReduce. It plays the following operations:

Create transient output directory for the activity in the course of initialization.

Then, it cle the job as in eliminates temporary output listing put up task finishing touch.

Sets up the undertaking brief output.

Identifies whether or not a venture wishes dedicate. The commit is applied if required.

JobSetup, JobCleanup and TaskCleanup are important responsibilities in the course of output commit.

Q27. What Is Identity Mapper?

Identity Mapper is the default Mapper elegance furnished via Hadoop. When no other Mapper magnificence is defined, Identify can be executed. It only writes the input records into output and do no longer perform and computations and calculations at the input statistics. The magnificence name is org.Apache.Hadoop.Mapred.Lib.IdentityMapper.

Q28. What Is Inputsplit?

A chew of facts processed with the aid of a single mapper referred to as InputSplit. In any other phrases logical chunk of information which processed by way of a single mapper referred to as Input break up, by way of default inputSplit = block Size.

Q29. What Is Chain Mapper?

Chain mapper elegance is a special mapper elegance units which run in a chain style within a single map venture. It me, one mapper enter acts as every other mapper’s input, in this manner n wide variety of mapper related in chain style.

Q30. What Happen If Number Of Reducer Is 0?

Number of reducer = 0 additionally valid configuration in MapReduce. In this scenario, No reducer will execute, so mapper output recall as output, Hadoop save this facts in separate folder.

Q31. When Used Reducer?

To combine a couple of mapper’s output used reducer. Reducer has three primary phases sort, shuffle and decrease. It’s viable to process statistics without reducer, but used while the shuffle and sort is needed.

Q32. What Are The Parameters Of Mappers And Reducers?

The four parameters for mappers are:

LongWritable (enter)

text (enter)

text (intermediate output)

IntWritable (intermediate output)

The four parameters for reducers are:

Text (intermediate output)

IntWritable (intermediate output)

Text (final output)

IntWritable (very last output)

Q33. What Is The Configuration Object Importance In Mapreduce?

It’s used to set/get of parameter call & fee pairs in XML document.It’s used to initialize values, read from external file and set as a value parameter.Parameter values in the program constantly overwrite with new values that are coming from external configure documents.Parameter values acquired from Hadoop’s default values.

Q34. Why Tasktracker Launch Child Jvm To Do A Task? Why Not Use Existent Jvm?

Sometime baby threads currupt figure threads. It me due to programmer mistake entired MapReduce assignment distruped. So venture tracker launch a toddler JVM to system person mapper or tasker. If tasktracker use existent JVM, it might harm primary JVM. If any insects occur, tasktracker kill the kid procedure and relaunch every other infant JVM to do the same undertaking. Usually project tracker relaunch and retry the undertaking four instances.

Q35. What Is Issplitable()?

By default this value is authentic. It is used to break up the records in the enter layout. If unestablished information, it’s not recommendable to break up the data, so method whole report as a one cut up. To do it first trade isSplitable() to fake.

Q36. What Is The Data Locality?

Whereever the facts is there manner the data, computation/process the records in which the facts to be had, this process known as facts locality. “Moving Computation is Cheaper than Moving Data” to obtain this intention observe information locality. It’s viable while the facts is splittable, by using default it’s proper.

Q37. Why Task Tracker Launch Child Jvm?

Most frequently, hadoop developer mistakenly submit wrong jobs or having insects. If Task Tracker use existent JVM, it is able to interrupt the primary JVM, so other tasks may additionally inspired. Where as infant JVM if it’s trying to damage existent assets, TaskTracker kill that toddler JVM and retry or relaunch new toddler JVM.

Q38. How Much Hadoop Allows Maximum Block Size And Minimum Block Size?

Minimum: 512 bytes. It’s local OS record machine block size. No it is easy to decrease fewer than block length.

Maximum: Depends on surroundings. There isn't any upperbound.

Q39. Explain Job Scheduling Through Jobtracker?

JobTracker communicates with NameNode to become aware of records location and submits the paintings to TaskTracker node. The TaskTracker plays a primary role as it notifies the JobTracker for any task failure. It virtually is stated the pulse reporter reassuring the JobTracker that it's miles nevertheless alive. Later, the JobTracker is accountable for the actions as in it is able to both resubmit the activity or mark a specific file as unreliable or blacklist it.

Q40. Where The Shuffle And Sort Process Does?

After Mapper generate the output temporary keep the intermediate facts on the neighborhood File System. Usually this transient record configured at middleweb page.Xml inside the Hadoop document. Hadoop Framework mixture and kind this intermediate facts, then replace into Hadoop to be processed by way of the Reduce characteristic. The Framework deletes this temporary information in the nearby system after Hadoop completes the job.

Q41. What Is The Main Difference Between Mapreduce Combiner And Reducer?

Both Combiner and Reducer are optional, but most regularly used in MapReduce. There are 3 main variations along with:

combiner will get only one enter from one Mapper. While Reducer gets more than one mappers from one-of-a-kind mappers.

If aggregation required used reducer, but if the function follows commutative (a.B=b.A) and associative a.(b.C)= (a.B).C regulation, use combiner.

Input and output keys and values types ought to same in combiner, but reducer can follows any type input, any output format.

Q42. Java Is Mandatory To Write Mapreduce Jobs?

No, By default Hadoop carried out in JavaTM, however MapReduce applications want now not be written in Java. Hadoop aid Python, Ruby, C++ and different Programming languages. Hadoop Streaming API lets in to create and run Map/Reduce jobs with any executable or script because the mapper and/or the reducer.Hadoop Pipes allows programmers to enforce MapReduce packages through using C++ packages.

Q43. What Is Hadoop Mapreduce ?

MapReduce is a hard and fast of applications used to process or examine extensive of facts over a Hadoop cluster. It technique the enormous amount of the datasets parallelly throughout the clusters in a faulttolerant way throughout the Hadoop framework.

Q44. Where Mapreduce Not Recommended?

Mapreduce isn't always endorsed for Iterative type of processing. It me repeat the output in a loop way.To process Series of Mapreduce jobs, MapReduce no longer suitable. Each process persists data in nearby disk, alternatively load to every other activity. It’s pricey operation and now not recommended.

Q45. How Much Ram Required To Process 64mb Data?

Leg expect. 64 block length, gadget take 2 mappers, 2 reducers, so sixty four*four = 256 MB memory and OS take atleast 30% extra space so atleast 256 + 80 = 326MB Ram required to manner a piece of information.So on this manner required extra reminiscence to procedure unbased system.

Q46. What Is Difference Between Mapside Join And Reduce Side Join? Or When We Goes To Mapside Join And Reduce Join?

Join multple tables in mapper side, called map facet be part of. Please note mapside be a part of ought to has strict format and taken care of nicely. If dataset is smaller tables, goes thru reducer word. Data ought to partitioned properly.

Join the multiple tables in reducer aspect known as reduce facet join. If you have got large quantity of records tables, making plans to join each tables. One desk is large quantity of rows and columns, another one has few wide variety of tables best, is going thru Rreduce side be part of. It’s the best way to enroll in the multiple tables

Q47. When We Goes To Reducer?

When type and shuffle is required then only goes to reducers in any other case no need partition. If clear out, no need to sort and shuffle. So with out reducer its viable to do that operation.

Q48. What Is Job Scheduling Importance In Hadoop Mapreduce?

Scheduling is a scientific technique of allocating sources within the high-quality possible manner amongst multiple duties. Hadoop venture tracker acting many strategies, sometime a particular technique should finish quick and offer extra prioriety, to do it few task schedulers come into the picture. Default Schedule is FIFO. Fair scheduling, FIFO and CapacityScheduler are most famous hadoop scheduling in hadoop.

Q49. What Are The Jobtracker & Tasktracker In Mapreduce?

MapReduce Framework consists of a single Job Tracker in keeping with Cluster, one Task Tracker in step with node. Usually A cluster has more than one nodes, so every cluster has single Job Tracker and a couple of TaskTrackers.JobTracker can time table the task and monitor the Task Trackers. If Task Tracker didn't execute obligations, try to re-execute the failed tasks.

TaskTracker comply with the JobTracker’s instructions and execute the duties. As a slave node, it document the task repute to Master JobTracker within the shape of Heartbeat.

Q50. How Many Mappers And Reducers Can Run?

By default Hadoop can run 2 mappers and a couple of reducers in one datanode. Additionally every node has 2 map slots and a couple of reducer slots. It’s viable to change this default values in Mapreduce.Xml in conf record.