Interview Questions.

Top 100+ Apache Spark Interview Questions And Answers - May 26, 2020


Top 100+ Apache Spark Interview Questions And Answers

Question 1. What Is Shark?

Answer :

Most of the information customers realize most effective SQL and aren't precise at programming. Shark is a device, developed for folks who are from a database heritage - to access Scala MLib talents via Hive like SQL interface. Shark tool allows statistics users run Hive on Spark - supplying compatibility with Hive metastore, queries and data.

Question 2. List Some Use Cases Where Spark Outperforms Hadoop In Processing.?

Answer :

Sensor Data Processing –Apache Spark’s ‘In-reminiscence computing’ works great here, as facts is retrieved and mixed from exceptional sources.
Spark is preferred over Hadoop for real time querying of information
Stream Processing – For processing logs and detecting frauds in stay streams for signals, Apache Spark is the pleasant answer.
Apache Tapestry Interview Questions
Question three. What Is A Sparse Vector?

Answer :

A sparse vector has two parallel arrays –one for indices and the alternative for values. These vectors are used for storing non-zero entries to keep space.

Question four. What Is Rdd?

Answer :

RDDs (Resilient Distributed Datasets) are simple abstraction in Apache Spark that constitute the data getting into the gadget in object format. RDDs are used for in-memory computations on huge clusters, in a fault tolerant way. RDDs are study-only portioned, collection of records, that are –

Immutable – RDDs cannot be altered.

Resilient – If a node maintaining the partition fails the alternative node takes the facts.

Apache Tapestry Tutorial
Question 5. Explain About Transformations And Actions In The Context Of Rdds.?

Answer :

Transformations are features achieved on call for, to supply a new RDD. All ameliorations are followed by using actions. Some examples of alterations include map, filter out and reduceByKey.

Actions are the outcomes of RDD computations or changes. After an motion is executed, the facts from RDD moves lower back to the neighborhood gadget. Some examples of actions include reduce, acquire, first, and take.

Apache Cassandra Interview Questions
Question 6. What Are The Languages Supported By Apache Spark For Developing Big Data Applications?

Answer :

Scala, Java, Python, R and Clojure

Question 7. Can You Use Spark To Access And Analyse Data Stored In Cassandra Databases?

Answer :

Yes, it's miles viable in case you use Spark Cassandra Connector.

Apache Cassandra Tutorial Apache Solr Interview Questions
Question 8. Is It Possible To Run Apache Spark On Apache Mesos?

Answer :

Yes, Apache Spark can be run on the hardware clusters managed by way of Mesos.

Question 9. Explain About The Different Cluster Managers In Apache Spark?

Answer :

The 3 one of a kind clusters managers supported in Apache Spark are:

Apache Mesos -Has rich aid scheduling abilities and is well acceptable to run Spark along with other packages. It is fantastic when numerous customers run interactive shells as it scales down the CPU allocation among commands.
Standalone deployments – Well applicable for new deployments which only run and are easy to set up.
Apache Storm Interview Questions
Question 10. How Can Spark Be Connected To Apache Mesos?

Answer :

To join Spark with Mesos:

Configure the spark motive force application to hook up with Mesos. Spark binary package must be in a vicinity on hand by way of Mesos. (or)
Install Apache Spark in the identical place as that of Apache Mesos and configure the assets ‘spark.Mesos.Executor.Home’ to point to the vicinity wherein it's miles mounted.
Apache Solr Tutorial
Question 11. How Can You Minimize Data Transfers When Working With Spark?

Answer :

Minimizing facts transfers and averting shuffling facilitates write spark programs that run in a quick and reliable way.

The various ways wherein information transfers can be minimized while working with Apache Spark are:

Using Broadcast Variable- Broadcast variable complements the efficiency of joins among small and large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel whilst executing.
The maximum common way is to avoid operations ByKey, repartition or another operations which trigger shuffles.
Apache Hive Interview Questions
Question 12. Why Is There A Need For Broadcast Variables When Working With Apache Spark?

Answer :

These are study most effective variables, present in-memory cache on each system. When working with Spark, utilization of broadcast variables removes the need to deliver copies of a variable for each assignment, so data may be processed faster. Broadcast variables help in storing a lookup desk within the reminiscence which complements the retrieval efficiency when compared to an RDD research ().

Apache Tapestry Interview Questions
Question thirteen. Is It Possible To Run Spark And Mesos Along With Hadoop?

Answer :

Yes, it's far viable to run Spark and Mesos with Hadoop by way of launching every of those as a separate provider on the machines. Mesos acts as a unified scheduler that assigns tasks to both Spark or Hadoop.

Apache Storm Tutorial
Question 14. What Is Lineage Graph?

Answer :

The RDDs in Spark, depend on one or extra other RDDs. The representation of dependencies in among RDDs is called the lineage graph. Lineage graph records is used to compute each RDD on call for, in order that on every occasion a part of continual RDD is lost, the statistics that is lost may be recovered the usage of the lineage graph records.

Question 15. How Can You Trigger Automatic Clean-americaIn Spark To Handle Accumulated Metadata?

Answer :

You can trigger the clean-usaby setting the parameter ‘spark.Cleanser.Ttl’ or through dividing the long going for walks jobs into different batches and writing the intermediary results to the disk.

Apache Pig Interview Questions
Question 16. Explain About The Major Libraries That Constitute The Spark Ecosystem?

Answer :

Spark MLib- Machine gaining knowledge of library in Spark for commonly used mastering algorithms like clustering, regression, class, and so on.

Spark Streaming – This library is used to procedure actual time streaming statistics.

Spark GraphX – Spark API for graph parallel computations with simple operators like joinVertices, subgraph, aggregateMessages, and many others.

Spark SQL – Helps execute SQL like queries on Spark information using general visualization or BI equipment.

Apache Hive Tutorial
Question 17. What Are The Benefits Of Using Spark With Apache Mesos?

Answer :

It renders scalable partitioning among various Spark times and dynamic partitioning among Spark and other large records frameworks.

Apache Flume Interview Questions
Question 18. What Is The Significance Of Sliding Window Operation?

Answer :

Sliding Window controls transmission of data packets between numerous pc networks. Spark Streaming library offers windowed computations wherein the variations on RDDs are applied over a sliding window of facts. Whenever the window slides, the RDDs that fall inside the particular window are combined and operated upon to provide new RDDs of the windowed DStream.

Apache Cassandra Interview Questions
Question 19. What Is A Dstream?

Answer :

Discretized Stream is a series of Resilient Distributed Databases that constitute a stream of facts. DStreams may be constructed from diverse assets like Apache Kafka, HDFS, and Apache Flume.

DStreams have  operations: –

Transformations that produce a new DStream.
Output operations that write facts to an outside system.
Apache Pig Tutorial
Question 20. When Running Spark Applications, Is It Necessary To Install Spark On All The Nodes Of Yarn Cluster?

Answer :

Spark want now not be installed when walking a task under YARN or Mesos due to the fact Spark can execute on top of YARN or Mesos clusters with out affecting any alternate to the cluster.

Apache Kafka Interview Questions
Question 21. What Is Catalyst Framework?

Answer :

Catalyst framework is a brand new optimization framework present in Spark SQL. It allows Spark to robotically remodel SQL queries by means of adding new optimizations to build a quicker processing machine.

Question 22. Name A Few Companies That Use Apache Spark In Production.?

Answer :

Pinterest, Conviva, Shopify, Open Table

Apache Flume Tutorial
Question 23. Which Spark Library Allows Reliable File Sharing At Memory Speed Across Different Cluster Frameworks?

Answer :


Apache Ant Interview Questions
Question 24. Why Is Blinkdb Used?

Answer :

BlinkDB is a query engine for executing interactive SQL queries on big volumes of data and renders query consequences marked with significant error bars. BlinkDB facilitates customers balance ‘question accuracy’ with reaction time.

Apache Solr Interview Questions
Question 25. How Can You Compare Hadoop And Spark In Terms Of Ease Of Use?

Answer :

Hadoop MapReduce calls for programming in Java that is difficult, although Pig and Hive make it substantially easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for one-of-a-kind languages like Java, Python or Scala and also consists of Shark i.E. Spark SQL for SQL fans - making it comparatively simpler to use than Hadoop.

Apache Kafka Tutorial
Question 26. What Are The Common Mistakes Developers Make When Running Spark Applications?

Answer :

Developers often make the error of:-

Hitting the net carrier several times by the usage of multiple clusters.
Run the whole thing at the neighborhood node as opposed to distributing it.
Developers want to be careful with this, as Spark makes use of reminiscence for processing.
Apache Camel Interview Questions
Question 27. What Is The Advantage Of A Parquet File?

Answer :

Parquet document is a columnar format record that enables:

Limit I/O operations
Consumes much less space
Fetches most effective required columns.
Apache Storm Interview Questions
Question 28. What Are The Various Data Sources Available In Sparksql?

Answer :

Parquet record
JSON Datasets
Hive tables
Apache Ant Tutorial
Question 29. What Are The Key Features Of Apache Spark That You Like?

Answer :

Spark gives superior analytic alternatives like graph algorithms, system getting to know, streaming statistics, and many others
It has integrated APIs in multiple languages like Java, Scala, Python and R
It has true overall performance profits, as it facilitates run an software in the Hadoop cluster ten instances faster on disk and a hundred times faster in reminiscence.
Apache Tajo Interview Questions
Question 30. What Do You Understand By Pair Rdd?

Answer :

Special operations may be performed on RDDs in Spark the use of key/price pairs and such RDDs are called Pair RDDs. Pair RDDs permit customers to get entry to every key in parallel. They have a reduceByKey () method that collects statistics primarily based on every key and a be a part of () approach that combines unique RDDs collectively, based at the factors having the identical key.

Question 31. Explain About The Different Types Of Transformations On Dstreams?

Answer :

Stateless Transformations:- Processing of the batch does no longer rely upon the output of the preceding batch.

Examples: map (), reduceByKey (), filter ().

Stateful Transformations:- Processing of the batch depends on the intermediary consequences of the previous batch.

Examples: Transformations that rely on sliding windows.

Apache Tajo Tutorial
Question 32. Explain About The Popular Use Cases Of Apache Spark?

Answer :

Apache Spark is in particular used for:

Iterative gadget learning.
Interactive information analytics and processing.
Stream processing
Sensor data processing
Apache Impala Interview Questions
Question 33. Is Apache Spark A Good Fit For Reinforcement Learning?

Answer :

No. Apache Spark works properly simplest for simple machine mastering algorithms like clustering, regression, classification.

Apache Hive Interview Questions
Question 34. What Is Spark Core?

Answer :

It has all of the simple functionalities of Spark, like - memory management, fault recuperation, interacting with garage systems, scheduling duties, etc.

Question 35. How Can You Remove The Elements With A Key Present In Any Other Rdd?

Answer :

Use the subtractByKey () function.

Question 36. What Is The Difference Between Persist() And Cache()?

Answer :

persist () permits the consumer to specify the storage degree in which as cache () uses the default storage level.

Apache Pig Interview Questions
Question 37. What Are The Various Levels Of Persistence In Apache Spark?

Answer :

Apache Spark mechanically persists the intermediary data from diverse shuffle operations, but it's miles often advised that customers name persist () technique at the RDD in case they plan to reuse it. Spark has numerous endurance degrees to store the RDDs on disk or in reminiscence or as a combination of both with distinct replication levels.

The numerous storage/patience degrees in Spark are:


Question 38. How Spark Handles Monitoring And Logging In Standalone Mode?

Answer :

Spark has an internet based totally person interface for tracking the cluster in standalone mode that indicates the cluster and process data. The log output for every activity is written to the work listing of the slave nodes.

Question 39. Does Apache Spark Provide Check Pointing?

Answer :

Lineage graphs are constantly useful to recover RDDs from a failure but this is usually time ingesting if the RDDs have long lineage chains. Spark has an API for check pointing i.E. A REPLICATE flag to persist. However, the decision on which information to checkpoint - is decided by means of the consumer. Checkpoints are beneficial while the lineage graphs are lengthy and feature wide dependencies.

Question forty. How Can You Launch Spark Jobs Inside Hadoop Mapreduce?

Answer :

Using SIMR (Spark in MapReduce) users can run any spark task inner MapReduce with out requiring any admin rights.

Apache Flume Interview Questions
Question 41. How Spark Uses Akka?

Answer :

Spark makes use of Akka basically for scheduling. All the workers request for a task to grasp after registering. The grasp simply assigns the mission. Here Spark makes use of Akka for messaging among the employees and masters.

Question forty two. How Can You Achieve High Availability In Apache Spark?

Answer :

Implementing single node healing with nearby document gadget
Using StandBy Masters with Apache ZooKeeper.
Apache Kafka Interview Questions
Question forty three. Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark?

Answer :

Data storage version in Apache Spark is based on RDDs. RDDs help obtain fault tolerance via lineage. RDD continually has the information on the way to construct from other datasets. If any partition of a RDD is misplaced because of failure, lineage enables construct handiest that particular lost partition.

Question 44. Explain About The Core Components Of A Distributed Spark Application.?

Answer :

Driver: The technique that runs the primary () technique of this system to create RDDs and perform modifications and actions on them.

Executor: The employee tactics that run the man or woman tasks of a Spark job.

Cluster Manager: A pluggable aspect in Spark, to launch Executors and Drivers. The cluster manager lets in Spark to run on pinnacle of other outside managers like Apache Mesos or YARN.

Question forty five. What Do You Understand By Lazy Evaluation?

Answer :

Spark is highbrow within the way in which it operates on facts. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a observe of it, in order that it does no longer neglect - but it does not anything, except asked for the final end result.

When a metamorphosis like map () is referred to as on a RDD-the operation is not carried out straight away. Transformations in Spark aren't evaluated until you carry out an motion. This facilitates optimize the general data processing workflow.

Question 46. Define A Worker Node.?

Answer :

A node that could run the Spark utility code in a cluster can be known as as a employee node. A worker node will have multiple employee which is configured via placing the SPARK_ WORKER_INSTANCES belongings in the spark-env.Sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES belongings is not defined.

Question forty seven. What Do You Understand By Schemardd?

Answer :

An RDD that includes row items (wrappers round primary string or integer arrays) with schema information about the form of statistics in each column.

Question 48. What Are The Disadvantages Of Using Apache Spark Over Hadoop Mapreduce?

Answer :

Apache spark does no longer scale nicely for compute in depth jobs and consumes big variety of system sources. Apache Spark’s in-memory capability at times comes a chief roadblock for cost efficient processing of big statistics. Also, Spark does have its own file management device and subsequently wishes to be integrated with different cloud based totally records systems or apache hadoop.

Question forty nine. Is It Necessary To Install Spark On All The Nodes Of A Yarn Cluster While Running Apache Spark On Yarn ?

Answer :

No , it isn't always important because Apache Spark runs on top of YARN.

Question 50. What Do You Understand By Executor Memory In A Spark Application?

Answer :

Every spark application has same constant heap size and fixed quantity of cores for a spark executor. The heap length is what called the Spark executor reminiscence that is controlled with the spark.Executor.Memory property of the –executor-memory flag.

Every spark utility can have one executor on every employee node. The executor memory is basically a measure on how a whole lot reminiscence of the employee node will the utility utilize.

Question fifty one. What Does The Spark Engine Do?

Answer :

Spark engine schedules, distributes and video display units the facts application throughout the spark cluster.

Question 52. What Makes Apache Spark Good At Low-latency Workloads Like Graph Processing And Machine Learning?

Answer :

Apache Spark stores information in-memory for faster version building and education. Machine getting to know algorithms require more than one iterations to generate a ensuing most useful model and similarly graph algorithms traverse all the nodes and edges.

These low latency workloads that need more than one iterations can cause elevated overall performance. Less disk access and  managed network traffic make a huge distinction whilst there is lots of records to be processed.