Top 24 Apache Hadoop Yarn Interview Questions
Q1. Differentiate Between Nfs, Hadoop Namenode And Journal Node?
HDFS is a write once report gadget so a consumer cannot replace the files once they exist both they are able to read or write to it. However, under sure scenarios inside the agency surroundings like record uploading, file downloading, file surfing or statistics streaming –it isn't always possible to obtain all this the usage of the standard HDFS. This is wherein a disbursed document device protocol Network File System (NFS) is used. NFS permits get entry to to files on far flung machines simply similar to how nearby record device is accessed with the aid of packages.
Namenode is the heart of the HDFS report system that keeps the metadata and tracks wherein the report facts is saved across the Hadoop cluster.
StandBy Nodes and Active Nodes speak with a collection of light weight nodes to hold their kingdom synchronized. These are called Journal Nodes.
Q2. How Do You Setup Ha For Resource Manager?
Resource Manager is accountable for scheduling applications and tracking sources in a cluster. Prior to Hadoop 2.4, the Resource Manager does not have option to be setup for HA and is a unmarried point of failure in a YARN cluster.
Since Hadoop 2.Four, YARN Resource Manager may be setup for high availability. High availability of Resource Manager is enabled via use of Active/Standby architecture. At any factor of time, one Resource Manager is active and one or more of Resource Managers are inside the standby mode. In case the active Resource Manager fails, one of the standby Resource Managers tritions to an active mode.
Q3. What Is Difference Between Hadoop Namenode Federation, Nfs And Journal Node?
HDFS federation can separate the namespace and storage to improve the scalability and isolation.
Q4. What Is The Difference Between Mapreduce 1 And Mapreduce 2/yarn?
In MapReduce 1, Hadoop centralized all duties to the Job Tracker. It allocates sources and scheduling the roles across the cluster. In YARN, de-centralized this to ease the work stress at the Job Tracker. Resource Manager duty allocate sources to the precise nodes and Node manager agenda the roles on the utility Master. YARN allows parallel execution and Application Master handling and execute the activity. This approach can ease many Job Tracker problems and improves to scale up ability and optimize the process overall performance. Additionally YARN can lets in to create multiple packages to scale up on the distributed surroundings.
Q5. What Are The Additional Benefits Yarn Brings In To Hadoop?
Effective utilization of the sources as multiple packages may be run in YARN all sharing a commonplace resource. In Hadoop MapReduce there are seperate slots for Map and Reduce duties while in YARN there may be no fixed slot. The same field may be used for Map and Reduce tasks main to higher utilization.
YARN is backward like minded so all the existing MapReduce jobs.
Using YARN, one may even run applications that are not primarily based at the MapReduce model.
Q6. What Is Apache Hadoop Yarn?
YARN is a powerful and green function rolled out as part of Hadoop 2.0.YARN is a large scale distributed system for going for walks big information applications.
Q7. Is Yarn A Replacement Of Hadoop Mapreduce?
YARN is not a substitute of Hadoop but it's far a greater powerful and green technology that helps MapReduce and is likewise known as Hadoop 2.Zero or MapReduce 2.
Q8. What Are The Key Components Of Yarn?
The simple concept of YARN is to break up the functionality of aid management and process scheduling/tracking into separate daemons.
YARN includes the following specific additives:
Resource Manager - The Resource Manager is a international component or daemon, one in line with cluster, which manages the requests to and sources across the nodes of the cluster.
Node Manager - Node Manger runs on every node of the cluster and is accountable for launching and monitoring packing containers and reporting the repute again to the Resource Manager.
Application Master is a per-application element that is chargeable for negotiating useful resource requirements for the resource manager and working with Node Managers to execute and display the duties.
Container is YARN framework is a UNIX procedure running on the node that executes an utility-unique technique with a restrained set of sources (Memory, CPU, and many others.).
Q9. What Are The Core Changes In Hadoop 2.X?
Many changes, specially single factor of failure and Decentralize Job Tracker energy to information-nodes is the primary adjustments. Entire activity tracker architecture changed.
Some of the primary difference between Hadoop 1.X and a couple of.X given underneath:
Single point of failure – Rectified.
Nodes quandary (4000- to limitless) – Rectified.
Job Tracker bottleneck – Rectified.
Map-lessen slots are modified static to dynamic.
High availability – Available.
Support each Interactive, graph iterative algorithms (1.X no longer help).
Allows other applications also to integrate with HDFS.
Q10. How Do You Setup Resource Manager To Use Capacity Scheduler?
You can configure the Resource Manager to use Capacity Scheduler by means of putting the fee of belongings 'yarn.Resourcemanager.Scheduler.Magnificence' to 'org.Apache.Hadoop.Yarn.Server.Resourcemanager.Scheduler.Ability.CapacityScheduler' within the file 'conf/yarn-website.Xml'.
Q11. Mistakenly User Deleted A File, How Hadoop Remote From Its File System? Can U Roll Back It?
HDFS first renames its file name and location it in /trash directory for a configurable amount of time. In this scenario block may freed, but not report. After this time, Namenode deletes the file from HDFS name-space and make document freed. It’s configurable as fs.Trash.C program languageperiod in core-website.Xml. By default its cost is 1, you may set to zero to delete file without storing in trash.
Q12. Steps To Upgrade Hadoop 1.X To Hadoop 2.X?
To upgrade 1.X to 2.X dont upgrade without delay. Simple down load locally then get rid of antique documents in 1.X documents. Up-gradation take greater time.
Share folder there. Its important.. Percentage.. Hadoop .. Mapreduce .. Lib.
Stop all processes.
Delete old meta records info… from work/hadoop2data
Copy and rename first 1.X facts into paintings/hadoop2.X
Don’t format NN whilst up gradation.
Hadoop namenode -improve // It will take a number of time.
Don’t near previous terminal open new terminal.
Hadoop namenode -rollback.
Q13. Explain The Differences Between Hadoop 1.X And Hadoop 2.X?
In Hadoop 1.X, MapReduce is chargeable for both processing and cluster control while in Hadoop 2.X processing is sorted by way of other processing fashions and YARN is answerable for cluster management.
Hadoop 2.X scales better while as compared to Hadoop 1.X with near 10000 nodes in step with cluster.
Hadoop 1.X has unmarried factor of failure problem and each time the Namenode fails it has to be recovered manually. However, in case of Hadoop 2.X StandBy Namenode overcomes the SPOF problem and each time the Namenode fails it's miles configured for computerized healing.
Hadoop 1.X works at the concept of slots whereas Hadoop 2.X works at the idea of bins and also can run regularly occurring responsibilities.
Q14. How Hadoop Determined The Distance Between Two Nodes?
Hadoop admin write a script known as Topology script to determine the rack area of nodes. It is cause to understand the space of the nodes to copy the facts. Configure this script in middle-website online.Xml
<Property>
<name>topology.Script.File.Name</name>
<value>center/rack-consciousness.Sh</value>
</property>
in the rack-recognition.Sh you should write script wherein the nodes located.
Q15. How Do You Setup Resource Manager To Use Fair Scheduler?
You can configure the Resource Manager to use FairScheduler by placing the cost of property 'yarn.Resourcemanager.Scheduler.Magnificence' to 'org.Apache.Hadoop.Yarn.Server.Resourcemanager.Scheduler.Truthful.FairScheduler' in the document 'conf/yarn-website.Xml'.
Q16. What Are The Scheduling Policies Available In Yarn?
YARN scheduler is responsible for scheduling resources to user packages based on a defined scheduling policy. YARN gives three scheduling options - FIFO scheduler, Capacity scheduler and Fair scheduler.
FIFO Scheduler - FIFO scheduler puts software requests in queue and runs them inside the order of submission.
Capacity Scheduler - Capacity scheduler has a separate committed queue for smaller jobs and starts offevolved them as soon as they're submitted.
Fair Scheduler - Fair scheduler dynamically balances and allocates assets among all of the jogging jobs.
Q17. How Is The Distance Between Two Nodes Defined In Hadoop?
Measuring bandwidth is tough in Hadoop so community is denoted as a tree in Hadoop. The distance between nodes within the tree performs a essential role in forming a Hadoop cluster and is described by way of the community topology and java interface DNS Switch Mapping. The distance is equal to the sum of the gap to the nearest not unusual ancestor of each the nodes. The approach get Distance(Node node1, Node node2) is used to calculate the distance among two nodes with the assumption that the distance from a node to its parent node is usually 1.
Q18. Yarn Is Replacement Of Mapreduce?
YARN is general idea, it aid MapReduce, but it’s not replacement of MapReduce. You can development many programs with the help of YARN. Spark, drill and plenty of more packages work on the pinnacle of YARN.
Q19. What Are The Core Changes In Hadoop 2.Zero?
Hadoop 2.X gives an improve to Hadoop 1.X in terms of useful resource control, scheduling and the manner in which execution takes place. In Hadoop 2.X the cluster useful resource management competencies work in isolation from the MapReduce specific programming logic. This helps Hadoop to proportion resources dynamically among multiple parallel processing frameworks like Impala and the core MapReduce thing. Hadoop 2.X Hadoop 2.X lets in workable and satisfactory grained useful resource configuration leading to green and higher cluster usage so that the utility can scale to manner larger range of jobs.
Q20. What Is Yarn?
Apache YARN, which stands for 'Yet another Resource Negotiator', is Hadoop cluster resource management device.
YARN affords APIs for soliciting for and operating with Hadoop's cluster sources. These APIs are normally utilized by additives of Hadoop's disbursed frameworks inclusive of MapReduce, Spark, and Tez and so forth. Which might be constructing on pinnacle of YARN. User applications generally do no longer use the YARN APIs directly. Instead, they use better stage APIs supplied by way of the framework (MapReduce, Spark, and so forth.) which hide the useful resource control details from the user.
Q21. How Can Native Libraries Be Included In Yarn Jobs?
There are methods to consist of native libraries in YARN jobs:-
By setting the -Djava.Library.Course on the command line however in this case there are possibilities that the local libraries might not be loaded efficiently and there's possibility of mistakes.
The higher option to include native libraries is to the set the LD_LIBRARY_PATH inside the .Bashrc report.
Q22. What Are The Core Concepts/processes In Yarn?
Resource manager: As equal to the Job Tracker
Node supervisor: As equal to the Task Tracker.
Application supervisor: As equal to Jobs. Everything is application in YARN. When purchaser put up process (software),
Containers: As equivalent to slots.
Yarn infant: If you put up the application, dynamically Application master launch Yarn infant to do Map and Reduce obligations.
If utility manager failed, now not a problem, resource supervisor automatically start new application undertaking.
Q23. What Are The Modules That Constitute The Apache Hadoop 2.0 Framework?
Hadoop 2.0 consists of 4 vital modules of which three are inherited from Hadoop 1.Zero and a new module YARN is brought to it.
Hadoop Common – This module consists of all the basic utilities and libraries that required by other modules.
HDFS- Hadoop Distributed report gadget that shops large volumes of records on commodity machines throughout the cluster.
MapReduce- Java primarily based programming version for data processing.
YARN- This is a brand new module added in Hadoop 2.Zero for cluster useful resource management and process scheduling.
Q24. What Is Resource Manager In Yarn?
The YARN Resource Manager is a global component or daemon, one according to cluster, which manages the requests to and sources across the nodes of the cluster.
The Resource Manager has two principal additives - Scheduler and Applications Manager.
Scheduler - The scheduler is liable for allocating sources to and starting packages based totally at the abstract belief of useful resource containers having a limited set of assets.
Application Manager - The Applications Manager is answerable for accepting activity-submissions, negotiating the primary container for executing the application unique Application Master and gives the provider for restarting the Application Master container on failure.
