SQOOP Interview Questions and Answers
Q1. What is Sqoop ?
Ans: Sqoop is a tool designed to switch information among Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop report machine to relational databases.
Q2. What is Sqoop metastore?
Ans: Sqoop metastore is a shared metadata repository for far off customers to define and execute saved jobs created the usage of sqoop activity described within the metastore. The sqoop –web site.Xml should be configured to connect with the metastore.
Q3. What are the 2 document formats supported by means of sqoop for import?
Ans: Delimited textual content and Sequence Files.
Q4. What is the difference between Sqoop and DistCP command in Hadoop?
Ans: Both distCP (Distributed Copy in Hadoop) and Sqoop switch records in parallel however the most effective distinction is that distCP command can switch any kind of records from one Hadoop cluster to some other whereas Sqoop transfers facts among RDBMS and different additives inside the Hadoop ecosystem like HBase, Hive, HDFS, and many others.
Q5. Compare Sqoop and Flume
Ans: Sqoop vs Flume
Used for uploading statistics from based facts resources like RDBMS. Used for transferring bulk streaming statistics into HDFS.
It has a connector based totally architecture. It has a agent based totally structure.
Data import in sqoop is not evetn driven. Data load in flume is occasion driven
HDFS is the destination for importing records. Data flows into HDFS via one or greater channels.
Q6. What do you imply via Free Form Import in Sqoop?
Ans: Sqoop can import statistics form a relational database the usage of any SQL question rather than simplest the usage of desk and column call parameters.
Q7. Does Apache Sqoop have a default database?
Ans: Yes, MySQL is the default database
Q8. How are you able to execute a loose shape SQL question in Sqoop to import the rows in a sequential way?
Ans: This can be performed the usage of the –m 1 choice within the Sqoop import command. It will create simplest one MapReduce project so that it will then import rows serially.
Q9. I have around three hundred tables in a database. I want to import all of the tables from the database except the tables named Table298, Table 123, and Table299. How can I do that without having to import the tables one at a time?
Ans: This can be executed using the import-all-tables import command in Sqoop and by way of specifying the exclude-tables choice with it as follows-
--join –username –password --exclude-tables Table298, Table 123, Table 299
Q10. How can I import huge objects (BLOB and CLOB gadgets) in Apache Sqoop?
Ans: Apache Sqoop import command does not guide direct import of BLOB and CLOB large gadgets. To import large items, I Sqoop, JDBC primarily based imports have to be used with out the direct argument to the import utility.
Q11. How will you list all the columns of a table the use of Apache Sqoop?
Ans: Unlike sqoop-list-tables and sqoop-list-databases, there's no direct command like sqoop-listing-columns to list all of the columns. The indirect manner of reaching that is to retrieve the columns of the favored tables and redirect them to a record which can be viewed manually containing the column names of a selected table.
Sqoop import --m 1 --join 'jdbc: sqlserver: //nameofmyserver; database=nameofmydatabase; username=DeZyre; password=mypassword' --question "SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name='mytableofinterest' AND $CONDITIONS" --target-dir 'mytableofinterest_column_name'
Q12. The incoming value from HDFS for a selected column is NULL. How will you load that row into RDBMS wherein the columns are described as NOT NULL?
Ans: Using the –input-null-string parameter, a default fee may be specified so that the row receives inserted with the default cost for the column that it has a NULL value in HDFS.
Q13. What is the importance of the use of –split-with the aid of clause for going for walks parallel import responsibilities in Apache Sqoop?
Ans: --Split-through clause is used to specify the columns of the desk which are used to generate splits for statistics imports. This clause specifies the columns in order to be used for splitting while importing the statistics into the Hadoop cluster. —split-by way of clause enables reap stepped forward performance thru more parallelism. Apache Sqoop will create splits based on the values present inside the columns precise in the –cut up-by clause of the import command. If the –break up-by using clause is not precise, then the primary key of the table is used to create the splits at the same time as facts import. At instances the number one key of the desk won't have frivolously allotted values among the minimum and maximum variety. Under such instances –break up-by using clause may be used to specify some different column that has even distribution of facts to create splits so that data import is efficient.
Q14. What is the default document layout to import records the use of Apache Sqoop?
Ans: Sqoop lets in facts to be imported the use of two report codecs
Delimited Text File Format
This is the default record layout to import statistics the use of Sqoop. This record layout may be explicitly specified the usage of the –as-textfile argument to the import command in Sqoop. Passing this as an argument to the command will produce the string based totally illustration of all the data to the output documents with the delimited characters between rows and columns.
Sequence File Format
It is a binary record layout where facts are stored in custom file-unique records sorts which can be shown as Java training. Sqoop mechanically creates those data sorts and manifests them as java training.
Q15. What are the fundamental commands in HadoopSqoop and its makes use of?
Ans: The simple commands of HadoopSqoop are
Codegen, Create-hive-desk, Eval, Export, Help, Import, Import-all-tables, List-databases, List-tables,Versions.
Useof HadoopSqoop fundamental commands
Codegen- It enables to generate code to have interaction with database statistics.
Create-hive-table- It allows to Import a table definition right into a hive
Eval- It facilitates to evaluateSQL statement and show the effects
Export-It facilitates to export an HDFS directory into a database desk
Help- It allows to list the available commands
Import- It enables to import a table from a database to HDFS
Import-all-tables- It allows to import tables from a database to HDFS
List-databases- It enables to listing available databases on a server
List-tables-It facilitates to listing tables in a database
Version-It helps to display the model records
For every sqoop copying into HDFS what number of MapReduce jobs and responsibilities might be submitted?
There are four jobs with a purpose to be submitted to each Sqoop copying into HDFS and no lessen tasks are scheduled.
Q16. You successfully imported a desk the use of Apache Sqoop to HBase however when you question the desk it's miles found that the range of rows is much less than predicted. What might be the probably motive?
Ans: If the imported statistics have rows that comprise null values for all the columns, then likely the ones statistics might have been dropped off at some stage in import because HBase does no longer permit null values in all of the columns of a report.
Q17. Explain the importance of the use of –split-with the aid of clause in Apache Sqoop?
Ans: cut up-with the aid of is a clause, it is used to specify the columns of the desk which can be supporting to generate splits for statistics imports all through importing the data into the Hadoop cluster. This clause specifies the columns and facilitates to enhance the overall performance thru extra parallelism. And also it helps to specify the column that has a fair distribution of statistics to create splits,that facts is imported.
Q18. If the supply facts gets up to date from time to time, how will you synchronise the information in HDFS that is imported via Sqoop?
Ans: Data can be synchronised the usage of incremental parameter with information import –
--Incremental parameter can be used with one of the alternatives-
append-If the desk is getting up to date continuously with new rows and increasing row id values then incremental import with append choice must be used in which values of a number of the columns are checked (columns to be checked are special the usage of –check-column) and if it discovers any changed value for the ones columns then most effective a brand new row might be inserted.
Lastmodified – In this form of incremental import, the supply has a date column that is checked for. Any records that have been up to date after the ultimate import based at the lastmodifed column inside the supply, the values might be updated.
Q19. How can Sqoop be utilized in Java applications?
Ans: In the Java code Sqoop jar is protected inside the classpath. The required parameters are created to Sqoop programmatically like for CLI (command line interface). Sqoop.RunTool() approach additionally invoked in Java code.
Q20. Below command is used to specify the connect string that contains hostname to attach MySQL with nearby host and database name as test_db –
–connect jdbc: mysql: //localhost/test_db
Is the above command the first-rate manner to specify the connect string in case I want to apply Apache Sqoop with a dispensed hadoop cluster?
Ans: When the use of Sqoop with a distributed Hadoop cluster the URL need to now not be designated with localhost inside the join string due to the fact the connect string may be carried out on all of the DataNodes with the Hadoop cluster. So, if the literal call localhost is stated as opposed to the IP deal with or the entire hostname then every node will connect to a one of a kind database on their localhosts. It is continually recommended to specify the hostname that may be visible via all remote nodes.
Q21. I am having around 500 tables in a database. I want to import all the tables from the database except the tables named Table498, Table 323, and Table199. How are we able to try this while not having to import the tables separately?
Ans: This can be talented using the import-all-tables, import command in Sqoop and by using specifying the exclude-tables option with it as follows-
–connect –username –password –exclude-tables Table498, Table 323, Table 199
Q22. You use –cut up-with the aid of clause however it nonetheless does now not provide most advantageous performance then how are you going to improve the performance further.
Ans: Using the –boundary-query clause. Generally, sqoop uses the SQL query pick out min (), max () from to find out the boundary values for developing splits. However, if this query isn't always surest then the usage of the –boundary-question argument any random question can be written to generate two numeric columns.
Q23. During sqoop import, you use the clause –m or –numb-mappers to specify the variety of mappers as eight in order that it can run eight parallel MapReduce duties, but, sqoop runs only four parallel MapReduce tasks. Why?
Ans: Hadoop MapReduce cluster is configured to run a most of four parallel MapReduce obligations and the sqoop import may be configured with range of parallel duties less than or identical to four however no longer more than four.
Q24. How can you see the list of stored jobs in sqoop metastore?
Ans: sqoop process –list
Q25. Give a sqoop command to import records from all tables within the MySql DB DB1.
Ans: sqoop --tables --connect jdbc:mysql://host/DB1
Q26. Where can the metastore database be hosted?
Ans: The metastore database may be hosted everywhere within or outside of the Hadoop clust