Interview Questions.

Apache Spark Interview Questions and Answers


Apache Spark Interview Questions and Answers

Q1. What is Apache Spark ?

Ans: Apache Spark is a lightning-fast cluster computing era, designed for immediate computation. It is primarily based on Hadoop MapReduce and it extends the MapReduce model to correctly use it for extra varieties of computations, which includes interactive queries and circulate processing. The foremost feature of Spark is its in-memory cluster computing that increases the processing pace of an utility.

Spark is designed to cowl a wide range of workloads inclusive of batch packages, iterative algorithms, interactive queries and streaming. Apart from assisting these kind of workload in a respective gadget, it reduces the control burden of maintaining separate equipment.

Q2. What is sparkContext?

Ans: SparkContext is the access factor to Spark. Using sparkContext you create RDDs which provided various ways of churning information.

Q3. Why is Spark faster than MapReduce?

Ans: There are few important motives why Spark is quicker than MapReduce and some of them are under:

There is not any tight coupling in Spark i.E., there's no mandatory rule that lessen must come after map.

Spark tries to keep the statistics “in-memory” as a great deal as feasible.

In MapReduce, the intermediate information could be stored in HDFS and consequently takes longer time to get the records from a source but this isn't always the case with Spark.

Q4. Explain the Apache Spark Architecture.


Apache Spark utility includes  applications specifically a Driver program and Workers application.

A cluster manager could be there in-between to have interaction with those two cluster nodes. Spark Context will hold in touch with the employee nodes with the assist of Cluster Manager.

Spark Context is sort of a grasp and Spark workers are like slaves.

Workers incorporate the executors to run the job. If any dependencies or arguments ought to be handed then Spark Context will cope with that. RDD’s will are living on the Spark Executors.

You also can run Spark programs domestically the use of a thread, and in case you need to take gain of dispensed environments you can take the help of S3, HDFS or every other garage machine

Q5. What are the important thing functions of Spark.


Allows Integration with Hadoop and documents protected in HDFS.

Spark has an interactive language shell because it has an impartial Scala (the language wherein Spark is written) interpreter.

Spark includes RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.

Spark supports more than one analytic equipment which are used for interactive query evaluation , real-time analysis and graph processing

Q6. What is Shark?

Ans: Most of the records customers understand handiest SQL and are not desirable at programming. Shark is a device, advanced for those who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark device enables facts customers run Hive on Spark - imparting compatibility with Hive metastore, queries and records.

Q7. On which all platform can Apache Spark run?

Ans: Spark can run on the following structures:

YARN (Hadoop): Since yarn can cope with any sort of workload, the spark can run on Yarn. Though there are  modes of execution. One in which the Spark driver is achieved within the field on node and 2d in which the Spark motive force is carried out at the patron gadget. This is the maximum common way of the usage of Spark.

Apache Mesos: Mesos is an open source appropriate upcoming useful resource supervisor. Spark can run on Mesos.

EC2: If you do no longer need to control the hardware via yourself, you may run the Spark on pinnacle of Amazon EC2. This makes spark appropriate for diverse groups.

Standalone: If you have no resource supervisor hooked up for your organization, you could use the standalone manner. Basically, Spark affords its own resource supervisor. All you need to do is installation Spark on all nodes in a cluster, tell each node about all nodes and begin the cluster. It starts offevolved speaking with every different and run.

Q8. What are the various programming languages supported by means of Spark?

Ans: Though Spark is written in Scala, it we could the users code in diverse languages inclusive of:




R (Using SparkR)

SQL (Using SparkSQL)

Also, with the aid of the way of piping the data via other commands, we have to be capable of use all forms of programming languages or binaries.

Q9. Compare Spark vs Hadoop MapReduce

Spark vs Hadoop

Criteria    Hadoop MapReduce    Apache Spark

Memory    Does not leverage the memory of the hadoop cluster to maximum.    Let's store facts on memory with using RDD's.

Disk utilization    MapReduce is disk orientated.    Spark caches data in-memory and ensures low latency.

Processing    Only batch processing is supported    Supports real-time processing through spark streaming.

Installation    Is bound to hadoop.    Is no longer sure to Hadoop.

Q10. What are actions and transformations?

Ans: Transformations create new RDD’s from existing RDD and those transformations are lazy and will not be achieved until you name any motion.

Eg: map(), filter out(), flatMap(), etc.,

Actions will return consequences of an RDD.

Eg: lessen(), count(), accumulate(), and so on.,

Q11. What are the diverse storages from which Spark can read information?

Ans: Spark has been designed to manner facts from numerous assets. So, whether you need to procedure records stored in HDFS, Cassandra, EC2, Hive, HBase, and Alluxio (previously Tachyon). Also, it is able to read statistics from any device that supports any Hadoop statistics supply.

Q12. List a few use cases wherein Spark outperforms Hadoop in processing.


Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works satisfactory right here, as records is retrieved and combined from distinct assets.

Spark is preferred over Hadoop for actual time querying of records

Stream Processing – For processing logs and detecting frauds in stay streams for alerts, Apache Spark is the first-class solution.

Q13. What is Spark Driver?

Ans: Spark Driver is the program that runs on the grasp node of the system and proclaims variations and movements on records RDDs. In easy phrases, motive force in Spark creates SparkContext, linked to a given Spark Master.The driving force also can provide the RDD graphs to Master, where the standalone cluster supervisor runs.

Q14. What are Accumulators?

Ans: Accumulators are the write only variables that are initialized once and despatched to the people. These workers will update based totally on the good judgment written and sent lower back to the motive force so as to mixture or manner based on the good judgment.

Only driver can access the accumulator’s fee. For duties, Accumulators are write-only. For example, it is used to depend the number errors visible in RDD across employees.

Q15 . What is Hive on Spark?

Ans: Hive carries massive support for Apache Spark, in which Hive execution is configured to Spark:

hive> set spark.Home=/area/to/sparkHome;

hive> set hive.Execution.Engine=spark;

Hive on Spark helps Spark on yarn mode via default.

Q16. What are Broadcast Variables?

Ans: Broadcast Variables are the examine-only shared variables. Suppose, there may be a set of statistics which may ought to be used a couple of times within the workers at special levels, we can proportion all those variables to the people from the driving force and each system can read them.

Q17. What are the optimizations that developer can make at the same time as running with spark?


1.Spark is reminiscence intensive, anything you do it does in memory.

2.Firstly, you may modify how long spark will wait before it times out on every of the phases of data locality (facts local –> method neighborhood –> node neighborhood –> rack local –> Any)

three.Filter out information as early as possible. For caching, choose accurately from numerous storage levels.

Four.Tune the wide variety of partitions in spark.

Q18. What is Spark SQL?

Ans: Spark SQL is a module for structured information processing in which we take gain of SQL queries going for walks at the datasets.

Q19. What is Spark Streaming?

Ans: Whenever there may be statistics flowing continuously and also you need to procedure the information as early as feasible, in that case you may take the advantage of Spark Streaming. It is the API for flow processing of stay facts. Data can drift for Kafka, Flume or from TCP sockets, Kenisis etc., and you can do complex processing at the information before you pushing them into their locations. Destinations can be record structures or databases or any other dashboards.

Q20. What is Sliding Window?

Ans: In Spark Streaming, you have to specify the batch c program languageperiod. For instance, allow’s take your batch c language is 10 seconds, Now Spark will procedure the data anything it receives inside the ultimate 10 seconds i.E., closing batch interval time.But with Sliding Window, you may specify how many remaining batches has to be processed. In the underneath screen shot, you can see that you may specify the batch c programming language and how many batches you need to technique. Apart from this, you could additionally specify when you need to system your ultimate sliding window. For example you need to manner the closing three batches when there are 2 new batches. That is like whilst you want to slide and how many batches must be processed in that window.

Q21. What does MLlib do?

Ans: MLlib is scalable device learning library provided by way of Spark. It targets at making device learning smooth and scalable with commonplace getting to know algorithms and use instances like clustering, regression filtering, dimensional reduction, and alike.

Q22. List the capabilities of Spark SQL.?

Ans: Spark SQL is capable of:

Loading information from a spread of established resources.

Querying statistics the usage of SQL statements, both internal a Spark software and from external tools that hook up with Spark SQL thru popular database connectors (JDBC/ODBC). For instance, the usage of business intelligence gear like Tableau.

Providing rich integration among SQL and normal Python/Java/Scala code, consisting of the potential to join RDDs and SQL tables, expose custom functions in SQL, and more.

Q23. How can Spark be linked to Apache Mesos?

Ans: To join Spark with Mesos-

Configure the spark driver application to connect to Mesos. Spark binary bundle should be in a vicinity available by way of Mesos. (or)

Install Apache Spark inside the equal vicinity as that of Apache Mesos and configure the belongings ‘spark.Mesos.Executor.Home’ to factor to the area wherein it is installed.

Q24. Is it possible to run Spark and Mesos at the side of Hadoop?

Ans: Yes, it is viable to run Spark and Mesos with Hadoop by way of launching every of these as a separate carrier at the machines. Mesos acts as a unified scheduler that assigns obligations to both Spark or Hadoop.

Q25. When strolling Spark programs, is it essential to put in Spark on all the nodes of YARN cluster?

Ans: Spark want now not be set up while strolling a job underneath YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any exchange to the cluster.

Q26. What is Catalyst framework?

Ans: Catalyst framework is a new optimization framework found in Spark SQL. It allows Spark to robotically rework SQL queries by way of adding new optimizations to construct a faster processing system.

Q27. Name a few groups that use Apache Spark in production.

Ans: Pinterest, Conviva, Shopify, Open Table

Q28. Why is BlinkDB used?

Ans: BlinkDB is a question engine for executing interactive SQL queries on big volumes of data and renders query consequences marked with significant blunders bars. BlinkDB enables customers balance ‘query accuracy’ with response time.

Q29. What are the common mistakes builders make while jogging Spark programs?

Ans: Developers regularly make the error of-

Hitting the internet carrier several times with the aid of the usage of more than one clusters.

Run the entirety at the local node rather than distributing it.

Developers want to be careful with this, as Spark makes use of reminiscence for processing.

Q30. What is the benefit of a Parquet record?

Ans: Parquet file is a columnar layout record that helps –

Limit I/O operations

Consumes less space

Fetches most effective required columns.