CrowdforGeeks | Build Skills with Online Courses from Top Institutions

Top 50 Hadoop Interview Questions

Q1. Which Are The Two Types Of 'writes' In Hdfs?

There are forms of writes in HDFS: published and non-posted write. Posted Write is when we write it and neglect about it, with out annoying about the acknowledgement. It is just like our conventional Indian publish. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is extra costly than the posted write. It is a good deal more steeply-priced, although each writes are asynchronous.

Q2. How Does Mapper's Run() Method Works?

The Mapper.Run() method then calls map(KeyInType, ValInType, Context) for every key/fee pair in the InputSplit for that venture

Q3. How Many Daemon Processes Run On A Hadoop System?

Hadoop is comprised of five separate daemons. Each of those daemon run in its very own JVM.Following three Daemons run on Master nodes

NameNode : This daemon stores and keeps the metadata for HDFS.

Secondary NameNode : Performs home tasks capabilities for the NameNode.

JobTracker : Manages MapReduce jobs, distributes individual responsibilities to machines walking the Task Tracker.

Following 2 Daemons run on each Slave nodes

DataNode : Stores actual HDFS statistics blocks.

TaskTracker : Responsible for instantiating and tracking person Map and Reduce responsibilities.

Q4. How The Client Communicates With Hdfs?

The Client verbal exchange to HDFS takes place the use of Hadoop HDFS API. Client programs speak to the NameNode each time they desire to find a report, or after they want to feature/reproduction/flow/delete a record on HDFS. The NameNode responds the a success requests by using returning a listing of relevant DataNode servers in which the facts lives. Client packages can communicate at once to a DataNode, once the NameNode has furnished the location of the information.

Q5. Does Mapreduce Programming Model Provide A Way For Reducers To Communicate With Each Other? In A Mapreduce Job Can A Reducer Communicate With Another Reducer?

Nope, MapReduce programming model does no longer permit reducers to speak with every other. Reducers run in isolation.

Q6. Can You Give A Detailed Overview About The Big Data Being Generated By Facebook?

As of December 31, 2012, there are 1.06 billion monthly lively users on fb and 680 million cellular users. On a median, three.2 billion likes and remarks are published every day on Facebook. 72% of net target market is on Facebook. And why no longer! There are so many activities occurring facebook from wall posts, sharing photos, motion pictures, writing remarks and liking posts, and many others. In fact, Facebook commenced using Hadoop in mid-2009 and changed into one of the preliminary customers of Hadoop.

Q7. Since The Data Is Replicated Thrice In Hdfs, Does It Mean That Any Calculation Done On One Node Will Also Be Replicated On The Other Two?

Since there are 3 nodes, while we send the MapReduce programs, calculations could be performed simplest on the unique information. The master node will realize which node precisely has that particular statistics. In case, if one of the nodes isn't responding, it's miles assumed to be failed. Only then, the desired calculation may be carried out on the second replica.

Q8. Who Is A 'person' In Hdfs?

A consumer is like you or me, who has some query or who needs some sort of information.

Q9. Replication Causes Data Redundancy Then Why Is Is Pursued In Hdfs?

HDFS works with commodity hardware (structures with average configurations) that has high chances of having crashed any time. Thus, to make the whole gadget exceedingly fault-tolerant, HDFS replicates and shops information in one of a kind places. Any facts on HDFS gets stored at as a minimum 3 extraordinary locations. So, even supposing one in every of them is corrupted and the opposite is unavailable for a while for any cause, then records can be accessed from the 1/3 one. Hence, there is no hazard of dropping the information. This replication thing facilitates us to achieve the feature of Hadoop known as Fault Tolerant.

Q10. What Is 'key Value Pair' In Hdfs?

Key value pair is the intermediate facts generated by way of maps and despatched to reduces for generating the very last output.

Q11. How Can I Install Cloudera Vm In My System?

When you enrol for the hadoop direction at Edureka, you can down load the Hadoop Installation steps.Pdf file from our dropbox.

Q12. Explain The Core Methods Of The Reducer?

The API of Reducer could be very much like that of Mapper, there's a run() approach that gets a Context containing the activity's configuration as well as interfacing methods that return information from the reducer itself back to the framework. The run() approach calls setup() as soon as, reduce() as soon as for every key related to the lessen challenge, and cleanup() as soon as at the stop. Each of those techniques can get entry to the job's configuration information by way of the usage of Context.GetConfiguration().

As in Mapper, any or all of those methods can be overridden with custom implementations. If none of those methods are overridden, the default reducer operation is the identification function; values are handed thru without similarly processing.

The coronary heart of Reducer is its lessen() method. This is known as as soon as according to key; the second one argument is an Iterable which returns all the values associated with that key.

Q13. What Are Combiners? When Should I Use A Combiner In My Mapreduce Job?

Combiners are used to growth the efficiency of a MapReduce software. They are used to mixture intermediate map output regionally on person mapper outputs. Combiners will let you lessen the quantity of records that needs to be trferred throughout to the reducers. You can use your reducer code as a combiner if the operation done is commutative and associative. The execution of combiner isn't assured, Hadoop may additionally or might not execute a combiner. Also, if required it can execute it more then 1 times. Therefore your MapReduce jobs must not rely upon the combiners execution.

Q14. What If Rack 2 And Datanode Fails?

If both rack2 and datanode found in rack 1 fails then there may be no hazard of having facts from it. In order to keep away from such conditions, we need to duplicate that information extra range of instances as opposed to replicating handiest three times. This may be achieved through converting the cost in replication aspect which is about to three via default.

Q15. Are Job Tracker And Task Trackers Present In Separate Machines?

Yes, activity tracker and task tracker are present in unique machines. The motive is activity tracker is a unmarried factor of failure for the Hadoop MapReduce carrier. If it is going down, all strolling jobs are halted.

Q16. What Is Structured And Unstructured Data?

Structured records is the facts this is without problems identifiable as it's miles organized in a structure. The maximum commonplace shape of based records is a database where specific statistics is saved in tables, this is, rows and columns. Unstructured statistics refers to any statistics that can not be recognized without problems. It will be in the shape of pix, videos, documents, e-mail, logs and random textual content. It isn't always within the shape of rows and columns.

Q17. If Datanodes Increase, Then Do We Need To Upgrade Namenode?

While installing the Hadoop system, Namenode is decided based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does no longer keep the actual records, however just the metadata, so any such requirement not often arise.

Q18. How Many Maximum Jvm Can Run On A Slave Node?

One or Multiple times of Task Instance can run on every slave node. Each assignment example is administered as a separate JVM process. The variety of Task times may be managed through configuration. Typically a high quit system is configured to run greater challenge instances.

Q19. Are Namenode And Job Tracker On The Same Host?

No, in realistic environment, Namenode is on a separate host and task tracker is on a separate host.

Q20. How Can You Add The Arbitrary Key-price Pairs In Your Mapper?

You can set arbitrary (key, fee) pairs of configuration statistics in your Job, e.G. With

Job.GetConfiguration().Set("myKey", "myVal"), after which retrieve this records in your mapper with

Context.GetConfiguration().Get("myKey"). This kind of functionality is generally completed inside the Mapper's setup() technique.

Q21. If A Data Node Is Full How It's Identified?

When records is saved in datanode, then the metadata of that information might be saved in the Namenode. So Namenode will perceive if the information node is full.

Q22. What Is The Jobtracker And What It Performs In A Hadoop Cluster?

JobTracker is a daemon service which submits and tracks the MapReduce responsibilities to the Hadoop cluster. It runs its very own JVM procedure. And normally it run on a separate system, and every slave node is configured with activity tracker node region. The JobTracker is single point of failure for the Hadoop MapReduce carrier. If it is going down, all strolling jobs are halted.

JobTracker in Hadoop plays following movements

Client programs put up jobs to the Job tracker.

The JobTracker talks to the NameNode to determine the area of the facts

The JobTracker locates TaskTracker nodes with available slots at or near the statistics

The JobTracker submits the paintings to the chosen TaskTracker nodes.

A TaskTracker will notify the JobTracker when a project fails. The JobTracker decides what to do then: it is able to resubmit the process somewhere else, it is able to mark that specific document as some thing to avoid, and it may can also even blacklist the TaskTracker as unreliable.

When the paintings is finished, the JobTracker updates its reputation.

The TaskTracker nodes are monitored. If they do now not post heartbeat signals often sufficient, they're deemed to have failed and the work is scheduled on a exceptional TaskTracker.

A TaskTracker will notify the JobTracker whilst a challenge fails. The JobTracker comes to a decision what to do then: it can resubmit the task some other place, it is able to mark that precise document as something to keep away from, and it may can also even blacklist the TaskTracker as unreliable.

When the paintings is completed, the JobTracker updates its repute.

Client packages can poll the JobTracker for statistics.

Q23. Explain The Shuffle?

Input to the Reducer is the taken care of output of the mappers. In this segment the framework fetches the applicable partition of the output of all the mappers, thru HTTP.

Q24. Why 'reading' Is Done In Parallel And 'writing' Is Not In Hdfs?

Reading is done in parallel because via doing so we will get entry to the records fast. But we do no longer carry out the write operation in parallel. The reason is if we carry out the write operation in parallel, then it might bring about statistics inconsistency. For example, you have got a record and two nodes are trying to write statistics into the document in parallel, then the first node does now not understand what the second one node has written and vice-versa. So, this makes it confusing which records to be stored and accessed.

Q25. What Are Some Of The Characteristics Of Hadoop Framework?

Hadoop framework is written in Java. It is designed to resolve problems that contain reading big facts (e.G. Petabytes). The programming model is primarily based on Google’s MapReduce. The infrastructure is primarily based on Google’s Big Data and Distributed File System. Hadoop handles massive files/data throughput and helps information extensive allotted applications. Hadoop is scalable as greater nodes can be effortlessly added to it.

Q26. What Is A Secondary Namenode? Is It A Substitute To The Namenode?

The secondary Namenode continuously reads the information from the RAM of the Namenode and writes it into the tough disk or the report device. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system is going down.

Q27. What Is Configuration Of A Typical Slave Node On Hadoop Cluster? How Many Jvms Run On A Slave Node?

Single example of a Task Tracker is administered on every Slave node. Task tracker is administered as a separate JVM system.

Single example of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM manner.

One or Multiple instances of Task Instance is run on every slave node. Each challenge example is administered as a separate JVM process. The quantity of Task times may be controlled with the aid of configuration. Typically a excessive give up system is configured to run greater project instances.

Q28. How Does A Namenode Handle The Failure Of The Data Nodes?

HDFS has master/slave architecture. An HDFS cluster includes a single

NameNode, a master server that manages the report machine namespace and regulates get admission to to files by using clients.

In addition, there are a number of DataNodes, normally one per node within the cluster, which manage storage attached to the nodes that they run on. The NameNode and DataNode are portions of software program designed to run on commodity machines.

NameNode periodically receives a Heartbeat and a Block file from every of the DataNodes within the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a listing of all blocks on a DataNode. When NameNode notices that it has not obtained a heartbeat message from a data node after a positive quantity of time, the information node is marked as lifeless. Since blocks may be underneath replicated the system begins replicating the blocks that had been stored at the lifeless DataNode. The NameNode Orchestrates the replication of information blocks from one DataNode to any other. The replication records trfer happens at once between DataNode and the information by no means passes thru the NameNode.

Q29. What Is Hadoop Framework?

Hadoop is a open supply framework which is written in java by way of apche software program basis. This framework is used to wirite software software which calls for to method large quantity of records (It ought to take care of multi tera bytes of information). It works in-paralle on massive clusters that can have a thousand of computers (Nodes) at the clusters. It additionally procedure records very reliably and fault-tolerant manner.

Q30. Is A Job Split Into Maps?

No, a job isn't always break up into maps. Spilt is created for the record. The file is positioned on datanodes in blocks. For each break up, a map is needed.

Q31. When The Reducers Are Are Started In A Mapreduce Job?

In a MapReduce job reducers do now not start executing the lessen approach till the all Map jobs have completed. Reducers begin copying intermediate key-fee pairs from the mappers as soon as they may be available. The programmer defined reduce technique is referred to as only after all the mappers have finished.

If reducers do not begin earlier than all mappers finish then why does the development on MapReduce activity indicates some thing like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed while mapper isn't always completed yet?

Reducers start copying intermediate key-cost pairs from the mappers as quickly as they're available. The development calculation additionally takes in account the processing of data trfer that's accomplished through reduce process, therefore the reduce progress starts offevolved displaying up as quickly as any intermediate key-value pair for a mapper is to be had to be trferred to reducer.

Though the reducer progress is updated still the programmer defined reduce approach is known as most effective after all the mappers have completed.

Q32. What Is A Metadata?

Metadata is the statistics approximately the records saved in statistics nodes such as area of the file, size of the document and so on.

Q33. What Are The Two Main Parts Of The Hadoop Framework?

Hadoop consists of important elements.

Hadoop disbursed file system, a distributed file gadget with high throughput,

Hadoop MapReduce, a software framework for processing huge facts sets.

Q34. What Is The Reducer Used For?

Reducer reduces a set of intermediate values which proportion a key to a (generally smaller) set of values.

The wide variety of reduces for the task is ready through the consumer thru Job.SetNumReduceTasks(int).

Q35. According To Ibm, What Are The Three Characteristics Of Big Data?

According to IBM, the three characteristics of Big Data are:

Volume: Facebook producing 500+ terabytes of information consistent with day.

Velocity: Analyzing 2 million facts every day to perceive the reason for losses.

Variety: pix, audio, video, sensor information, log documents, and so on.

Q36. What Is Streaming Access?

As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely essential in HDFS. HDFS focuses now not a lot on storing the statistics but a way to retrieve it at the quickest viable velocity, mainly while analyzing logs. In HDFS, analyzing the complete statistics is greater crucial than the time taken to fetch a single file from the information.

Q37. How Mapper Is Instantiated In A Running Job?

The Mapper itself is instantiated inside the strolling job, and might be surpassed a MapContext item which it could use to configure itself.

Q38. What Is The Difference Between Mapreduce Engine And Hdfs Cluster?

HDFS cluster is the name given to the complete configuration of master and slaves where statistics is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze information.

Q39. What Is The Communication Channel Between Client And Namenode/datanode?

The mode of communication is SSH.

Q40. Where Do You Specify The Mapper Implementation?

Generally mapper implementation is certain within the Job itself.

Q41. Is Client The End User In Hdfs?

No, Client is an software which runs in your machine, which is used to interact with the Namenode (task tracker) or datanode (assignment tracker).

Q42. Explain The Wordcount Implementation Via Hadoop Framework ?

We will remember the phrases in all of the input record waft as beneath

enter Assume there are two files every having a sentence Hello World Hello World (In record 1) Hello World Hello World (In record 2)

Mapper : There could be every mapper for the a document For the given pattern input the primary map output:

< Hello, 1>

< World, 1>

< Hello, 1>

< World, 1>

The 2nd map output:

< Hello, 1>

< World, 1>

< Hello, 1>

< World, 1>

Combiner/Sorting (This is accomplished for every character map) So output looks like this The output of the primary map:

< Hello, 2>

< World, 2>

The output of the second map:

< Hello, 2>

< World, 2>

Reducer : It sums up the above output and generates the output as under

< Hello, 4>

< World, 4>

Output

Final output would appear like

Hello 4 instances

World 4 instances

Q43. What Is The Difference Between Hdfs And Nas ?

The Hadoop Distributed File System (HDFS) is a disbursed document machine designed to run on commodity hardware. It has many similarities with existing allotted record systems. However, the differences from other distributed document systems are giant.

Following are differences among HDFS and NAS

In HDFS Data Blocks are allotted throughout local drives of all machines in a cluster. Whereas in NAS records is saved on dedicated hardware.

HDFS is designed to work with MapReduce System, considering the fact that computation are moved to information. NAS isn't appropriate for MapReduce due to the fact that records is saved separately from the computations.

HDFS runs on a cluster of machines and provides redundancy the use of a replication protocol. Whereas NAS is provided by using a unmarried device therefore does no longer provide records redundancy.

Q44. What Is A Heartbeat In Hdfs?

A heartbeat is a sign indicating that it's miles alive. A datanode sends heartbeat to Namenode and venture tracker will send its coronary heart beat to job tracker. If the Namenode or process tracker does no longer receive heart beat then they will decide that there's a few problem in datanode or project tracker is unable to perform the assigned mission.

Q45. Can Reducer Talk With Each Other?

No, Reducer runs in isolation.

Q46. When Is The Reducers Are Started In A Mapreduce Job?

In a MapReduce job reducers do now not begin executing the lessen approach until the all Map jobs have completed. Reducers begin copying intermediate key-value pairs from the mappers as soon as they're to be had. The programmer described lessen approach is referred to as handiest after all of the mappers have finished.

Q47. Which Interface Needs To Be Implemented To Create Mapper And Reducer For The Hadoop?

Org.Apache.Hadoop.Mapreduce.Mapper

org.Apache.Hadoop.Mapreduce.Reducer

Q48. How The Hdfs Blocks Are Replicated?

HDFS is designed to reliably store very huge files across machines in a massive cluster. It shops every file as a series of blocks; all blocks in a document besides the final block are the equal length. The blocks of a file are replicated for fault tolerance. The block size and replication element are configurable according to report. An software can specify the quantity of replicas of a document. The replication component may be particular at document advent time and may be modified later. Files in HDFS are write-as soon as and feature strictly one creator at any time.

The NameNode makes all selections concerning replication of blocks. HDFS makes use of rack-aware duplicate placement policy. In default configuration there are total three copies of a datablock on HDFS, 2 copies are saved on datanodes on same rack and third replica on a special rack.

Q49. Can I Set The Number Of Reducers To Zero?

Yes, Setting the variety of reducers to zero is a valid configuration in Hadoop. When you set the reducers to 0 no reducers might be done, and the output of each mapper could be stored to a separate record on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]

Q50. What Is A Datanode? How Many Instances Of Datanode Run On A Hadoop Cluster?

A DataNode stores facts in the Hadoop File System HDFS. There is best One DataNode manner run on any hadoop slave node. DataNode runs on its own JVM manner. On startup, a DataNode connects to the NameNode. DataNode times can talk to every other, this is in most cases all through replicating information.