Hadoop Administration Interview Questions and Answers
Q1. How will you decide whether you want to use the Capacity Scheduler or the Fair Scheduler?
Ans: Fair Scheduling is the method in which sources are assigned to jobs such that every one jobs get to share same range of sources through the years. Fair Scheduler may be used beneath the subsequent occasions:
If you desires the roles to make equal progress instead of following the FIFO order then you definitely should use Fair Scheduling.
If you have sluggish connectivity and records locality plays a vital position and makes a big distinction to the activity runtime then you definately should use Fair Scheduling.
Use honest scheduling if there's lot of variability in the utilization between swimming pools.
Capacity Scheduler lets in runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximise the utilization of the hadoop cluster and throughput. Capacity Scheduler may be used underneath the following situations:
If the roles require scheduler detrminism then Capacity Scheduler can be beneficial.
CS's reminiscence based totally scheduling technique is useful if the jobs have various memory necessities.
If you need to implement resource allocation because thoroughly about the cluster usage and workload then use Capacity Scheduler.
Q2. What are the daemons required to run a Hadoop cluster?
Ans: NameNode, DataNode, TaskTracker and JobTracker.
Q3. How will you restart a NameNode?
Ans: The easiest manner of doing this is to run the command to prevent walking shell script i.E. Click on forestall-all.Sh. Once this is completed, restarts the NameNode by way of clicking on start-all.Sh.
Q4. Explain about the special schedulers to be had in Hadoop.
FIFO Scheduler: This scheduler does no longer recollect the heterogeneity in the gadget however orders the jobs based totally on their arrival times in a queue.
COSHH: This scheduler considers the workload, cluster and the person heterogeneity for scheduling choices.
Fair Sharing: This Hadoop scheduler defines a pool for each person. The pool contains a number of map and reduce slots on a aid. Each consumer can use their personal pool to execute the roles.
Q5. List few Hadoop shell commands that are used to carry out a copy operation.
Q6. What is jps command used for?
Ans: jps command is used to verify whether the daemons that run the Hadoop cluster are running or now not. The output of jps command suggests the popularity of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.
Q7. What are the critical hardware considerations whilst deploying Hadoop in production environment?
Memory-System’s memory necessities will vary among the employee offerings and management services based totally at the software.
Operating System - a 64-bit running gadget avoids any regulations to be imposed on the amount of memory that can be used on employee nodes.
Storage- It is finest to design a Hadoop platform by means of transferring the compute interest to facts to reap scalability and excessive performance.
Capacity- Large Form Factor (3.Five”) disks fee less and permit to shop extra, whilst as compared to Small Form Factor disks.
Network - Two TOR switches in line with rack provide higher redundancy.
Computational Capacity- This can be determined by using the overall quantity of MapReduce slots available throughout all of the nodes inside a Hadoop cluster.
Q8. How many NameNodes can you run on a single Hadoop cluster?
Ans: Only one.
Q9. What happens when the NameNode at the Hadoop cluster is going down?
Ans: The file system goes offline each time the NameNode is down.
Q10. What is the conf/hadoop-env.Sh report and which variable in the record ought to be set for Hadoop to work?
Ans: This record affords an environment for Hadoop to run and includes the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable need to be set for Hadoop to run.
Q11. Apart from the usage of the jps command is there any other manner that you could check whether or not the NameNode is operating or no longer.
Ans: Use the command -/etc/init.D/hadoop-zero.20-namenode reputation.
Q12. In a MapReduce machine, if the HDFS block size is sixty four MB and there are three documents of size 127MB, 64K and 65MB with FileInputFormat. Under this situation, what number of enter splits are possibly to be made by way of the Hadoop framework.
Ans: 2 splits every for 127 MB and 65 MB files and 1 cut up for the 64KB record.
Q13. Which command is used to verify if the HDFS is corrupt or now not?
Ans: Hadoop FSCK (File System Check) command is used to test missing blocks.
Q14. List a few use instances of the Hadoop Ecosystem
Ans: Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.
Q15. How are you able to kill a Hadoop task?
Ans: Hadoop process –kill jobID.
Q16. I need to peer all the jobs running in a Hadoop cluster. How can you do that?
Ans: Using the command – Hadoop job –listing, offers the listing of jobs going for walks in a Hadoop cluster.
Q17. Is it feasible to copy files across a couple of clusters? If sure, how will you accomplish this?
Ans: Yes, it's far feasible to replicate documents across multiple Hadoop clusters and this will be executed using dispensed replica. DistCP command is used for intra or inter cluster copying.
Q18. Which is the nice operating system to run Hadoop?
Ans: Ubuntu or Linux is the maximum desired running device to run Hadoop. Though Windows OS also can be used to run Hadoop but it will cause numerous troubles and isn't always recommended.
Q19. What are the network requirements to run Hadoop?
SSH is required to run - to release server approaches at the slave nodes.
A password much less SSH connection is needed between the grasp, secondary machines and all the slaves.
Q20. The mapred.Output.Compress belongings is about to true, to make certain that all output documents are compressed for green area usage on the Hadoop cluster. In case below a specific circumstance if a cluster person does now not require compressed records for a activity. What would you suggest that he do?
Ans: If the user does not want to compress the records for a selected task then he should create his very own configuration file and set the mapred.Output.Compress belongings to fake. This configuration report then have to be loaded as a aid into the activity.
Q21. What is the first-class practice to set up a secondary NameNode?
Ans: It is constantly better to install a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate device it does no longer intervene with the operations of the number one node.
Q22. How regularly should the NameNode be reformatted?
Ans: The NameNode need to never be reformatted. Doing so will result in whole information loss. NameNode is formatted simplest once at the beginning after which it creates the listing structure for file device metadata and namespace ID for the whole document system.
Q23. If Hadoop spawns one hundred duties for a job and one of the job fails. What does Hadoop do?
Ans: The assignment can be commenced once more on a brand new TaskTracker and if it fails greater than four times that's the default placing (the default fee can be modified), the activity can be killed.
Q24. How can you add and eliminate nodes from the Hadoop cluster?
To add new nodes to the HDFS cluster, the hostnames need to be delivered to the slaves report after which DataNode and TaskTracker have to be commenced on the new node.
To remove or decommission nodes from the HDFS cluster, the hostnames ought to be eliminated from the slaves document and –refreshNodes have to be performed.
Q25. You growth the replication level however observe that the facts is under replicated. What may want to have long gone incorrect?
Ans: Nothing may want to have virtually wrong, if there may be huge extent of facts because statistics replication commonly takes instances primarily based on facts length as the cluster has to replicate the facts and it'd take some hours.
Q26. Explain about the unique configuration files and where are they positioned.
Ans: The configuration documents are located in “conf” sub directory. Hadoop has three unique Configuration documents- hdfs-web page.Xml, core-website.Xml and mapred-site.Xml.
Q27. Which running machine(s) are supported for production Hadoop deployment?
Ans: Which working machine(s) are supported for manufacturing Hadoop deployment? Role of the namenode?
Ans: The namenode is the "mind" of the Hadoop cluster and accountable for handling the distribution blocks on the machine based totally at the replication coverage. The namenode additionally components the unique addresses for the records primarily based at the client requests.
Q29. What happen at the namenode while a customer attempts to study a statistics record? Appearance up the statistics about file within the edit document and then retrieve the last information from filesystem memory snapshot<br>
Since the namenode wishes to assist a massive wide variety of the clients, the number one namenode will most effective ship data again for the statistics area. The datanode itselt is answerable for the retrieval.
Q30. What are the hardware necessities for a Hadoop cluster (primary and secondary namenodes and datanodes)?
Ans: There aren't any necessities for datanodes. However, the namenodes require a targeted amount of RAM to shop filesystem photograph in memory Based at the layout of the primary namenode and secondary namenode, complete filesystem facts could be saved in reminiscence. Therefore, both namenodes need to have sufficient reminiscence to contain the entire filesystem photograph Hadoop admin questions
Ans: Hadoop can be deployed in stand on my own mode, pseudo-distributed mode or fully-dispensed mode.
Hadoop was in particular designed to be deployed on multi-node cluster. However, it additionally can be deployed on unmarried system and as a unmarried method for testing purposes.
Q32. How could an Hadoop administrator set up various components of Hadoop in production?
Ans: Deploy namenode and jobtracker on the master node, and installation datanodes and taskstrackers on multiple slave nodes.
There is a want for handiest one namenode and jobtracker on the gadget. The variety of datanodes relies upon at the to be had hardware.
Q33. What is the first-class exercise to installation the secondary namenode.
Ans: Deploy secondary namenode on a separate standalone gadget.The secondary namenode desires to be deployed on a separate machine. It will now not intervene with number one namenode operations on this manner. The secondary namenode need to have the same memory requirements as the primary namenode.
Q34. Is there a popular method to set up Hadoop?
Ans: No, there are some variations among numerous distributions. However, they all require that Hadoop jars be mounted at the gadget<br>
There are some commonplace necessities for all Hadoop distributions however the specific methods will be special for specific companies given that they all have a few diploma of proprietary software program.
Q35. What is the function of the secondary namenode?
Ans: Secondary namenode plays CPU extensive operation of mixing edit logs and contemporary filesystem snapshots.
The secondary namenode become separated out as a process due to having CPU intensive operations and further necessities for metadata returned-up.
Q36. What are the side outcomes of not walking a secondary name node?
Ans: The cluster performance will degrade over time seeing that edit log will grow bigger and bigger<br>
If the secondary namenode isn't running in any respect, the edit log will develop substantially and it'll gradual the machine down. Also, the gadget will cross into safemode for an extended time for the reason that namenode needs to mix the edit log and the modern filesystem checkpoint image.
Q37. What take place if a datanode loses community connection for a few minutes?
Ans: The namenode will stumble on that a datanode isn't always responsive and could begin replication of the information from ultimate replicas. When datanode comes returned on-line, the more replicas will be;
The replication element is actively maintained by way of the namenode. The namenode monitors the status of all datanodes and maintains song which blocks are located on that node. The moment the datanode isn't avaialble it'll trigger replication of the statistics from the prevailing replicas. However, if the datanode comes back up, overreplicated statistics may be deleted. Note: the statistics is probably deleted from the authentic datanode.
Q38. What show up if one of the datanodes has a good deal slower CPU?
Ans: The project execution will be as fast because the slowest worker. However, if speculative execution is enabled, the slowest employee will now not have such big impact
Hadoop became specially designed to work with commodity hardware. The speculative execution facilitates to offset the sluggish workers. The more than one instances of the equal undertaking might be created and task tracker will take the primary end result into attention and the second one example of the task may be killed.
Q39. What is speculative execution?
Ans: If speculative execution is enabled, the activity tracker will trouble a couple of times of the identical undertaking on multiple nodes and it'll take the end result of the mission that finished first. The other instances of the project might be killed.
The speculative execution is used to offset the impact of the slow employees inside the cluster. The jobtracker creates more than one times of the equal project and takes the end result of the first successful mission. The rest of the obligations might be discarded.
Q40. After growing the replication level, I nonetheless see that statistics is under replicated. What can be wrong?
Ans: Data replication takes time due to large quantities of records. The Hadoop administrator have to allow sufficient time for records replication<br>
Depending at the facts length the information replication will make the effort. Hadoop cluster nonetheless desires to replicate data round and if records size is large sufficient it isn't always uncommon that replication will take from a few minutes to three hours.
Q41. How many racks do you need to create an Hadoop cluster so one can make sure that the cluster operates reliably?
Ans: In order to ensure a dependable operation it is encouraged to have at least 2 racks with rack placement configured.
Hadoop has a integrated rack awareness mechanism that permits statistics distribution between exclusive racks based totally on the configuration.
Q42. Are there any unique necessities for namenode?
Ans: Yes, the namenode holds information approximately all files in the system and wishes to be extra dependable.
The namenode is a unmarried point of failure. It desires to be extra reliable and metadata need to be replicated in more than one locations. Note that the community is working on solving the single point of failure trouble with the namenode.
Q43. If you have a report 128M length and replication thing is set to 3, what number of blocks can you locate at the cluster with a view to correspond to that report (assuming the default apache and cloudera configuration)?
Ans: Based at the configuration settings the document will be divided into more than one blocks in step with the default block size of 64M. 128M / 64M = 2 . Each block could be replicated consistent with replication thing settings (default three). 2 * 3 = 6.
What is distributed reproduction Hadoop admin questions
Ans: Distcp is a Hadoop software for launching MapReduce jobs to duplicate statistics. The number one usage is for copying a large amount of records.
One of the fundamental demanding situations in the Hadoop enviroment is copying data throughout a couple of clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the facts.
Q44. What is sent replica (distcp)?
Ans: Distcp is a Hadoop utility for launching MapReduce jobs to copy records. The primary usage is for copying a big quantity of statistics.
One of the main challenges within the Hadoop enviroment is copying statistics throughout a couple of clusters and distcp will allow a couple of datanodes to be leveraged for parallel copying of the statistics.
Q45. What is replication thing?
Ans: Replication issue controls how often every man or woman block may be replicated .
Data is replicated in the Hadoop cluster primarily based at the replication factor. The excessive replication element guarantees information availability within the occasion of failure.
Q46. What daemons run on Master nodes?
Ans: NameNode, Secondary NameNode and JobTracker.
Hadoop is made from five separate daemons and every of those daemon run in its personal JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on every Slave nodes.
Q47. What is rack attention?
Ans: Rack cognizance is the manner wherein the namenode decides a way to vicinity blocks primarily based on the rack definitions.
Hadoop will try to decrease the network visitors among datanodes within the equal rack and could handiest contact faraway racks if it has to. The namenode is able to manipulate this because of rack recognition.
Q48. What is the role of the jobtracker in an Hadoop cluster?
Ans: The jobtracker is answerable for scheduling duties on slave nodes, accumulating effects, retrying failed obligations.
The process tracker is the main aspect of the map-reduce execution. It manage the department of the job into smaller obligations, submits duties to person tasktracker, tracks the development of the roles and reports consequences returned to calling code.
How does the Hadoop cluster tolerate datanode screw ups?
Ans: Since Hadoop is design to run on commodity hardware, the datanode disasters are anticipated. Namenode continues tune of all to be had datanodes and actively maintains replication element on all information.
The namenode actively tracks the repute of all datanodes and acts right away if the datanodes emerge as non-responsive. The namenode is the imperative "brain" of the HDFS and begins replication of the information the instant a disconnect is detected.]
Q49. What is the technique for namenode restoration?
A namenode can be recovered in two approaches: starting new namenode from backup metadata or promoting secondary namenode to primary namenode<br>
The namenode healing technique is very crucial to ensure the reliability of the statistics.It can be achieved by beginning a new namenode using backup facts or via promoting the secondary namenode to primary.
Q50. Web-UI shows that half of of the datanodes are in decommissioning mode. What does that imply? Is it safe to cast off the ones nodes from the network?
Ans: This manner that namenode is attempting retrieve facts from the ones datanodes by using shifting replicas to ultimate datanodes. There is a possibility that facts may be misplaced if administrator gets rid of those datanodes earlier than decomissioning completed.
Due to replication strategy it's miles viable to lose a few facts because of datanodes elimination en masse prior to completing the decommissioning process. Decommissioning refers to namenode seeking to retrieve facts from datanodes via moving replicas to closing datanodes.
Q51. What does the Hadoop administrator need to do after adding new datanodes to the Hadoop cluster?
Ans: Since the new nodes will no longer have any records on them, the administrator needs to start the balancer to redistribute records frivolously between all nodes.
Hadoop cluster will discover new datanodes automatically. However, that allows you to optimize the cluster overall performance it's miles advocated to begin rebalancer to redistribute the data between datanodes evenly.
Q52. If the Hadoop administrator wishes to make a alternate, which configuration report does he want to alternate?
Ans: It depends on the nature of the trade. Each node has it`s own set of configuration documents and they are now not continually the same on each node.
Correct Answer is A - Each node inside the Hadoop cluster has its very own configuration files and the adjustments wishes to be made in each document. One of the reasons for that is that configuration can be exceptional for each node.
Q53. Map Reduce jobs are failing on a cluster that become simply restarted. They labored before restart. What may be wrong?
Ans: The cluster is in a secure mode. The administrator wishes to await namenode to exit the safe mode before restarting the jobs again.
This is a very commonplace mistake through Hadoop administrators while there's no secondary namenode on the cluster and the cluster has no longer been restarted in a long term. The namenode will cross into safemode and integrate the edit log and cutting-edge record system timestamp.