Top 50 Apache Spark Interview Questions
Q1. Is It Possible To Run Spark And Mesos Along With Hadoop?
Yes, it's miles feasible to run Spark and Mesos with Hadoop via launching each of those as a separate provider on the machines. Mesos acts as a unified scheduler that assigns duties to both Spark or Hadoop.
Q2. List Some Use Cases Where Spark Outperforms Hadoop In Processing.?
Sensor Data Processing –Apache Spark’s ‘In-reminiscence computing’ works fine here, as statistics is retrieved and mixed from one of a kind assets.
Spark is desired over Hadoop for actual time querying of information
Stream Processing – For processing logs and detecting frauds in stay streams for alerts, Apache Spark is the excellent answer.
Q3. What Do You Understand By Lazy Evaluation?
Spark is highbrow in the manner wherein it operates on information. When you inform Spark to function on a given dataset, it heeds the commands and makes a observe of it, so that it does now not neglect - but it does nothing, except asked for the final end result.
When a trformation like map () is referred to as on a RDD-the operation isn't done at once. Trformations in Spark are not evaluated until you carry out an movement. This allows optimize the overall statistics processing workflow.
Q4. How Can You Remove The Elements With A Key Present In Any Other Rdd?
Use the subtractByKey () function.
Q5. What Do You Understand By Schemardd?
An RDD that includes row items (wrappers round primary string or integer arrays) with schema facts about the kind of statistics in each column.
Q6. What Are The Common Mistakes Developers Make When Running Spark Applications?
Developers regularly make the mistake of:-
Hitting the web service numerous times by the use of a couple of clusters.
Run the entirety on the neighborhood node instead of dispensing it.
Developers want to be cautious with this, as Spark uses memory for processing.
Q7. What Is The Difference Between Persist() And Cache()?
Persist () permits the person to specify the garage level in which as cache () makes use of the default garage stage.
Q8. How Can You Minimize Data Trfers When Working With Spark?
Minimizing statistics trfers and avoiding shuffling enables write spark programs that run in a fast and reliable manner.
The diverse ways wherein data trfers can be minimized whilst working with Apache Spark are:
Using Broadcast Variable- Broadcast variable complements the efficiency of joins between small and huge RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel even as executing.
The maximum commonplace way is to keep away from operations ByKey, repartition or another operations which trigger shuffles.
Q9. Explain About The Major Libraries That Constitute The Spark Ecosystem?
Spark MLib- Machine learning library in Spark for typically used getting to know algorithms like clustering, regression, type, etc.
Spark Streaming – This library is used to technique real time streaming records.
Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, and so on.
Spark SQL – Helps execute SQL like queries on Spark facts using wellknown visualization or BI gear.
Q10. Explain About The Different Types Of Trformations On Dstreams?
Stateless Trformations:- Processing of the batch does not rely on the output of the preceding batch.
Examples: map (), reduceByKey (), clear out ().
Stateful Trformations:- Processing of the batch relies upon on the middleman outcomes of the preceding batch.
Examples: Trformations that rely upon sliding windows.
Q11. What Is The Significance Of Sliding Window Operation?
Sliding Window controls trmission of facts packets among diverse computer networks. Spark Streaming library affords windowed computations wherein the trformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall in the precise window are blended and operated upon to supply new RDDs of the windowed DStream.
Q12. What Is A Dstream?
Discretized Stream is a sequence of Resilient Distributed Databases that constitute a stream of facts. DStreams may be made out of numerous sources like Apache Kafka, HDFS, and Apache Flume.
DStreams have two operations: –
Trformations that produce a new DStream.
Output operations that write records to an external device.
Q13. Does Apache Spark Provide Check Pointing?
Lineage graphs are constantly beneficial to recover RDDs from a failure however this is usually time consuming if the RDDs have long lineage chains. Spark has an API for take a look at pointing i.E. A REPLICATE flag to persist. However, the selection on which records to checkpoint - is decided by means of the person. Checkpoints are useful whilst the lineage graphs are lengthy and feature extensive dependencies.
Q14. What Is Spark Core?
It has all of the primary functionalities of Spark, like - reminiscence control, fault recovery, interacting with garage structures, scheduling responsibilities, and many others.
Q15. What Are The Benefits Of Using Spark With Apache Mesos?
It renders scalable partitioning among diverse Spark instances and dynamic partitioning between Spark and different big data frameworks.
Q16. Is It Possible To Run Apache Spark On Apache Mesos?
Yes, Apache Spark can be run at the hardware clusters managed by means of Mesos.
Q17. Why Is There A Need For Broadcast Variables When Working With Apache Spark?
These are examine handiest variables, present in-reminiscence cache on each gadget. When running with Spark, utilization of broadcast variables gets rid of the need to deliver copies of a variable for every challenge, so statistics may be processed faster. Broadcast variables assist in storing a lookup table within the memory which complements the retrieval performance while as compared to an RDD lookup ().
Q18. Can You Use Spark To Access And Analyse Data Stored In Cassandra Databases?
Yes, it's far possible if you use Spark Cassandra Connector.
Q19. What Does The Spark Engine Do?
Spark engine schedules, distributes and video display units the facts application across the spark cluster.
Q20. Explain About Trformations And Actions In The Context Of Rdds.?
Trformations are functions accomplished on demand, to produce a brand new RDD. All trformations are followed by using actions. Some examples of trformations consist of map, filter out and reduceByKey.
Actions are the effects of RDD computations or trformations. After an action is done, the records from RDD actions lower back to the neighborhood machine. Some examples of moves consist of lessen, collect, first, and take.
Q21. Name A Few Companies That Use Apache Spark In Production.?
Pinterest, Conviva, Shopify, Open Table
Q22. What Are The Various Levels Of Persistence In Apache Spark?
Apache Spark robotically persists the intermediary information from diverse shuffle operations, but it's miles often suggested that customers call persist () technique at the RDD in case they plan to reuse it. Spark has diverse staying power tiers to shop the RDDs on disk or in memory or as a aggregate of both with distinct replication degrees.
The various storage/persistence levels in Spark are:
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP
Q23. Is Apache Spark A Good Fit For Reinforcement Learning?
No. Apache Spark works nicely most effective for simple device learning algorithms like clustering, regression, class.
Q24. Is It Necessary To Install Spark On All The Nodes Of A Yarn Cluster While Running Apache Spark On Yarn ?
No , it isn't always important due to the fact Apache Spark runs on pinnacle of YARN.
Q25. How Can You Launch Spark Jobs Inside Hadoop Mapreduce?
Using SIMR (Spark in MapReduce) customers can run any spark task inside MapReduce with out requiring any admin rights.
Q26. Which Spark Library Allows Reliable File Sharing At Memory Speed Across Different Cluster Frameworks?
Tachyon
Q27. How Spark Uses Akka?
Spark uses Akka essentially for scheduling. All the workers request for a challenge to grasp after registering. The grasp just assigns the project. Here Spark makes use of Akka for messaging among the employees and masters.
Q28. How Spark Handles Monitoring And Logging In Standalone Mode?
Spark has a web based person interface for tracking the cluster in standalone mode that suggests the cluster and process facts. The log output for each activity is written to the work listing of the slave nodes.
Q29. What Is Rdd?
RDDs (Resilient Distributed Datasets) are primary abstraction in Apache Spark that represent the statistics entering the device in item layout. RDDs are used for in-reminiscence computations on huge clusters, in a fault tolerant manner. RDDs are study-best portioned, collection of facts, that are –
Immutable – RDDs can't be altered.
Resilient – If a node keeping the partition fails the opposite node takes the statistics.
Q30. What Are The Various Data Sources Available In Sparksql?
Parquet report
JSON Datasets
Hive tables
Q31. What Do You Understand By Pair Rdd?
Special operations may be executed on RDDs in Spark using key/fee pairs and such RDDs are called Pair RDDs. Pair RDDs allow users to get entry to every key in parallel. They have a reduceByKey () approach that collects statistics primarily based on every key and a be a part of () method that mixes one of a kind RDDs together, based totally on the elements having the identical key.
Q32. How Can You Achieve High Availability In Apache Spark?
Implementing single node restoration with local document gadget
Using StandBy Masters with Apache ZooKeeper.
Q33. What Makes Apache Spark Good At Low-latency Workloads Like Graph Processing And Machine Learning?
Apache Spark stores information in-reminiscence for quicker model constructing and schooling. Machine getting to know algorithms require multiple iterations to generate a ensuing superior model and further graph algorithms traverse all the nodes and edges.
These low latency workloads that want a couple of iterations can lead to extended overall performance. Less disk get right of entry to and managed network visitors make a large distinction when there is lots of records to be processed.
Q34. How Can Spark Be Connected To Apache Mesos?
To connect Spark with Mesos:
Configure the spark motive force application to hook up with Mesos. Spark binary package should be in a place available by means of Mesos. (or)
Install Apache Spark in the same place as that of Apache Mesos and configure the property ‘spark.Mesos.Executor.Domestic’ to factor to the vicinity wherein it is hooked up.
Q35. What Is Catalyst Framework?
Catalyst framework is a brand new optimization framework present in Spark SQL. It permits Spark to routinely trform SQL queries by adding new optimizations to construct a faster processing gadget.
Q36. What Is Shark?
Most of the statistics users recognize most effective SQL and aren't good at programming. Shark is a tool, evolved for individuals who are from a database heritage - to get entry to Scala MLib competencies via Hive like SQL interface. Shark device enables facts customers run Hive on Spark - supplying compatibility with Hive metastore, queries and facts.
Q37. Why Is Blinkdb Used?
BlinkDB is a query engine for executing interactive SQL queries on large volumes of statistics and renders question effects marked with significant mistakes bars. BlinkDB facilitates users balance ‘query accuracy’ with reaction time.
Q38. What Are The Disadvantages Of Using Apache Spark Over Hadoop Mapreduce?
Apache spark does not scale nicely for compute extensive jobs and consumes large quantity of machine resources. Apache Spark’s in-memory functionality at instances comes a major roadblock for cost green processing of large records. Also, Spark does have its very own document control gadget and as a result wishes to be integrated with other cloud primarily based records structures or apache hadoop.
Q39. How Can You Trigger Automatic Clean-u.S.In Spark To Handle Accumulated Metadata?
You can trigger the clean-u.S.Through placing the parameter ‘spark.Purifier.Ttl’ or by means of dividing the long going for walks jobs into different batches and writing the middleman effects to the disk.
Q40. Explain About The Different Cluster Managers In Apache Spark?
The three specific clusters managers supported in Apache Spark are:
YARN
Apache Mesos -Has wealthy useful resource scheduling abilties and is well appropriate to run Spark in conjunction with other packages. It is tremendous while several users run interactive shells because it scales down the CPU allocation between instructions.
Standalone deployments – Well ideal for brand new deployments which handiest run and are smooth to installation.
Q41. Explain About The Core Components Of A Distributed Spark Application.?
Driver: The procedure that runs the principle () approach of the program to create RDDs and perform trformations and moves on them.
Executor: The worker techniques that run the character obligations of a Spark activity.
Cluster Manager: A pluggable component in Spark, to release Executors and Drivers. The cluster manager permits Spark to run on top of other outside managers like Apache Mesos or YARN.
Q42. Explain About The Popular Use Cases Of Apache Spark?
Apache Spark is mainly used for:
Iterative machine gaining knowledge of.
Interactive statistics analytics and processing.
Stream processing
Sensor facts processing
Q43. What Are The Languages Supported By Apache Spark For Developing Big Data Applications?
Scala, Java, Python, R and Clojure
Q44. Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark?
Data garage version in Apache Spark is based on RDDs. RDDs help obtain fault tolerance thru lineage. RDD usually has the facts on a way to construct from different datasets. If any partition of a RDD is lost due to failure, lineage enables build best that particular misplaced partition.
Q45. Define A Worker Node.?
A node that may run the Spark software code in a cluster may be referred to as as a employee node. A employee node may have more than one employee which is configured with the aid of putting the SPARK_ WORKER_INSTANCES belongings within the spark-env.Sh record. Only one employee is began if the SPARK_ WORKER_INSTANCES property isn't defined.
Q46. What Do You Understand By Executor Memory In A Spark Application?
Every spark utility has same constant heap size and fixed range of cores for a spark executor. The heap length is what called the Spark executor memory that is controlled with the spark.Executor.Reminiscence property of the –executor-reminiscence flag.
Every spark software will have one executor on every employee node. The executor memory is essentially a measure on how a great deal reminiscence of the employee node will the application make use of.
Q47. What Is A Sparse Vector?
A sparse vector has parallel arrays –one for indices and the alternative for values. These vectors are used for storing non-0 entries to store space.
Q48. When Running Spark Applications, Is It Necessary To Install Spark On All The Nodes Of Yarn Cluster?
Spark want not be set up when strolling a task under YARN or Mesos due to the fact Spark can execute on top of YARN or Mesos clusters with out affecting any exchange to the cluster.
Q49. How Can You Compare Hadoop And Spark In Terms Of Ease Of Use?
Hadoop MapReduce calls for programming in Java which is hard, even though Pig and Hive make it appreciably simpler. Learning Pig and Hive syntax takes time. Spark has interactive APIs for one of a kind languages like Java, Python or Scala and additionally includes Shark i.E. Spark SQL for SQL enthusiasts - making it relatively less complicated to use than Hadoop.
Q50. What Are The Key Features Of Apache Spark That You Like?
Spark offers superior analytic options like graph algorithms, gadget mastering, streaming facts, and so forth
It has built-in APIs in more than one languages like Java, Scala, Python and R
It has precise performance profits, as it enables run an application within the Hadoop cluster ten instances faster on disk and 100 instances faster in reminiscence.
