Interview Questions.

Interview Questions For Apache Spark and Scala


Interview Questions For Apache Spark and Scala

Q1. What are the numerous tiers of endurance in Apache Spark?

Ans: Apache Spark mechanically persists the intermediary information from diverse shuffle operations, however it's miles often cautioned that users name persist () method on the RDD in case they plan to reuse it. Spark has various persistence tiers to save the RDDs on disk or in reminiscence or as a combination of both with exclusive replication ranges.

Q2. What is Shark?

Ans: Most of the records customers recognize best SQL and are not precise at programming. Shark is a tool, advanced for folks who are from a database history - to access Scala MLib competencies via Hive like SQL interface. Shark device facilitates statistics customers run Hive on Spark - supplying compatibility with Hive metastore, queries and records.

Q3. List some use cases in which Spark outperforms Hadoop in processing.


Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works excellent right here, as records is retrieved and blended from unique resources.

Spark is preferred over Hadoop for real time querying of data

Stream Processing – For processing logs and detecting frauds in stay streams for alerts, Apache Spark is the best solution.

Q4. What is a Sparse Vector?

Ans: A sparse vector has two parallel arrays –one for indices and the opposite for values. These vectors are used for storing non-0 entries to store space.

Q5. What is RDD?

Ans: RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that constitute the records getting into the machine in item layout. RDDs are used for in-reminiscence computations on massive clusters, in a fault tolerant manner. RDDs are study-best portioned, series of records, which might be:

Immutable – RDDs can not be altered.

Resilient – If a node protecting the partition fails the other node takes the statistics.

Q6. Explain about variations and movements in the context of RDDs.

Ans: Transformations are capabilities finished on call for, to supply a brand new RDD. All differences are followed by way of actions. Some examples of changes consist of map, filter and reduceByKey.

Actions are the effects of RDD computations or ameliorations. After an movement is finished, the facts from RDD moves again to the local gadget. Some examples of moves encompass reduce, gather, first, and take.

Q7. What are the languages supported by using Apache Spark for growing huge information packages?

Ans: Scala, Java, Python, R and Clojure

Q8. Can you operate Spark to get right of entry to and analyse statistics stored in Cassandra databases?

Ans: Yes, it is feasible in case you use Spark Cassandra Connector.

Q9. Is it feasible to run Apache Spark on Apache Mesos?

Ans: Yes, Apache Spark may be run at the hardware clusters controlled by Mesos.

Q10. Explain approximately the one-of-a-kind cluster managers in Apache Spark

Ans: The 3 one-of-a-kind clusters managers supported in Apache Spark are:


Apache Mesos: Has wealthy aid scheduling skills and is well suited to run Spark along side other programs. It is high quality when numerous customers run interactive shells because it scales down the CPU allocation among commands.

Standalone deployments: Well ideal for brand new deployments which best run and are easy to installation.

Q11. How can Spark be linked to Apache Mesos?

Ans: To connect Spark with Mesos:

Configure the spark driver software to hook up with Mesos. Spark binary package need to be in a area on hand with the aid of Mesos. (or)

Install Apache Spark within the same region as that of Apache Mesos and configure the property ‘spark.Mesos.Executor.Domestic’ to point to the place in which it's far set up.

Q12. How are you able to decrease data transfers whilst operating with Spark?

Ans: Minimizing statistics transfers and fending off shuffling facilitates write spark programs that run in a quick and dependable manner. The numerous ways wherein records transfers can be minimized whilst running with Apache Spark are:

Using Broadcast Variable- Broadcast variable complements the performance of joins between small and massive RDDs.

Using Accumulators – Accumulators help replace the values of variables in parallel at the same time as executing.

The most not unusual manner is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Q13. Why is there a want for broadcast variables when working with Apache Spark?

Ans: These are study only variables, present in-reminiscence cache on every machine. When working with Spark, utilization of broadcast variables removes the need to deliver copies of a variable for every assignment, so information may be processed quicker. Broadcast variables assist in storing a research table inside the memory which enhances the retrieval efficiency while compared to an RDD lookup ().

Q14.  Is it feasible to run Spark and Mesos together with Hadoop?

Ans: Yes, it's far possible to run Spark and Mesos with Hadoop with the aid of launching every of those as a separate service at the machines. Mesos acts as a unified scheduler that assigns responsibilities to both Spark or Hadoop.

HubSpot Video

Q15. What is lineage graph?

Ans: The RDDs in Spark, rely on one or greater other RDDs. The illustration of dependencies in between RDDs is referred to as the lineage graph. Lineage graph statistics is used to compute each RDD on demand, in order that each time a part of persistent RDD is lost, the records this is misplaced can be recovered using the lineage graph facts.

Q16. How are you able to trigger automated easy-u.S.A.In Spark to deal with gathered metadata?

Ans: You can cause the clean-usathrough putting the parameter ‘spark.Cleaner.Ttl’ or by means of dividing the long running jobs into different batches and writing the middleman consequences to the disk.

Q17. Explain about the main libraries that represent the Spark Ecosystem.


Spark MLib- Machine gaining knowledge of library in Spark for typically used getting to know algorithms like clustering, regression, class, and so forth.

Spark Streaming – This library is used to manner real time streaming statistics.

Spark GraphX – Spark API for graph parallel computations with simple operators like joinVertices, subgraph, aggregateMessages, and many others.

Spark SQL – Helps execute SQL like queries on Spark facts the usage of wellknown visualization or BI gear.

Q18. What are the advantages of the use of Spark with Apache Mesos?

Ans: It renders scalable partitioning amongst various Spark instances and dynamic partitioning between Spark and different large facts frameworks.

Q19. What is the importance of Sliding Window operation?

Ans: Sliding Window controls transmission of information packets among numerous pc networks. Spark Streaming library presents windowed computations wherein the changes on RDDs are applied over a sliding window of information. Whenever the window slides, the RDDs that fall inside the particular window are blended and operated upon to provide new RDDs of the windowed DStream.

Q20. What is a DStream?

Ans: Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of information. DStreams can be created from numerous assets like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations:

Transformations that produce a brand new DStream.

Output operations that write statistics to an outside machine.

Q21. When running Spark programs, is it important to install Spark on all the nodes of YARN cluster?

Ans: Spark need no longer be hooked up when running a job underneath YARN or Mesos due to the fact Spark can execute on pinnacle of YARN or Mesos clusters with out affecting any change to the cluster.

Q22. What is Catalyst framework?

Ans: Catalyst framework is a brand new optimization framework found in Spark SQL. It permits Spark to automatically remodel SQL queries by adding new optimizations to build a quicker processing gadget.

Q23. Name a few agencies that use Apache Spark in production.

Ans: Pinterest, Conviva, Shopify, Open Table.

Q24. Which spark library permits reliable record sharing at memory pace across specific cluster frameworks?

Ans: Tachyon.

Q25. Why is BlinkDB used?

Ans: BlinkDB is a query engine for executing interactive SQL queries on large volumes of statistics and renders query effects marked with significant errors bars. BlinkDB helps users balance ‘query accuracy’ with response time.

Q26. How can you compare Hadoop and Spark in phrases of ease of use?

Ans: Hadoop MapReduce requires programming in Java that is difficult, though Pig and Hive make it significantly less difficult. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also consists of Shark i.E. Spark SQL for SQL fanatics - making it comparatively less difficult to apply than Hadoop.

Q27. What are the commonplace mistakes developers make while going for walks Spark applications?

Ans: Developers frequently make the mistake of:

Hitting the internet provider several instances through the usage of more than one clusters.

Run everything on the local node as opposed to dispensing it.

Developers need to be cautious with this, as Spark makes use of reminiscence for processing.

28. What is the gain of a Parquet record?

Ans: Parquet file is a columnar layout document that allows:

Limit I/O operations

Consumes much less space

Fetches only required columns.

Q29. What are the various data resources available in SparkSQL?


Parquet report

JSON Datasets

Hive tables

Q30. How Spark makes use of Hadoop?

Ans: Spark has its very own cluster management computation and in particular makes use of Hadoop for garage.

Q31. What are the key functions of Apache Spark that you like?


Spark offers advanced analytic options like graph algorithms, system gaining knowledge of, streaming facts, and so on

It has built-in APIs in multiple languages like Java, Scala, Python and R

It has good overall performance profits, as it enables run an software inside the Hadoop cluster ten instances quicker on disk and 100 times quicker in memory.

Q32. What do you recognize via Pair RDD?

Ans: Special operations can be performed on RDDs in Spark using key/fee pairs and such RDDs are called Pair RDDs. Pair RDDs allow customers to get right of entry to each key in parallel. They have a reduceByKey () technique that collects facts based totally on every key and a be part of () technique that combines specific RDDs together, primarily based on the elements having the equal key.

Q33. Which one will you pick out for a venture –Hadoop MapReduce or Apache Spark?

Ans: The solution to this query depends on the given assignment state of affairs - as it's miles recognized that Spark uses reminiscence as opposed to community and disk I/O. However, Spark uses big amount of RAM and requires dedicated gadget to provide effective results. So the selection to apply Hadoop or Spark varies dynamically with the necessities of the task and budget of the corporation.

Q34. Explain about the unique varieties of differences on DStreams?


Stateless Transformations- Processing of the batch does not rely on the output of the preceding batch. Examples – map (), reduceByKey (), filter ().

Stateful Transformations- Processing of the batch depends at the middleman effects of the previous batch. Examples –Transformations that rely upon sliding home windows.

Q35. Explain approximately the popular use cases of Apache Spark

Ans: Apache Spark is specifically used for:

Iterative machine getting to know.

Interactive information analytics and processing.

Stream processing.

Sensor facts processing.

Q36. Is Apache Spark a good fit for Reinforcement getting to know?

Ans: No. Apache Spark works well simplest for easy system studying algorithms like clustering, regression, category.

Q37. What is Spark Core?

Ans: It has all of the basic functionalities of Spark, like - reminiscence control, fault healing, interacting with storage systems, scheduling tasks, etc.

Q38. How can you dispose of the elements with a key present in some other RDD?

Ans: Use the subtractByKey () feature.

Q39. What is the difference among persist() and cache()

Ans: persist () permits the consumer to specify the storage level while cache () uses the default storage stage.

Q40. Explain what is Scala?

Ans: Scala is an item purposeful programming and scripting language for widespread software packages designed to specific answers in a concise way.

Q41. What is a ‘Scala set’? What are techniques thru which operation units are expressed?

Ans: Scala set is a group of pairwise factors of the same kind.  Scala set does now not include any reproduction factors.  There are  sorts of sets, mutable and immutable.

Q42. What is a ‘Scala map’?

Ans: Scala map is a collection of key or cost pairs.  Based on its key any value can be retrieved.  Values are not specific but keys are precise inside the Map.

Q43. What is the gain of Scala?


Less mistakes prone useful fashion

High maintainability and productiveness

High scalability

High testability

Provides capabilities of concurrent programming

Q44. In what approaches Scala is higher than other programming language?


The arrays makes use of regular generics, whilst in other language, generics are bolted on as an afterthought and are completely separate but have overlapping behaviours with arrays.

Scala has immutable “val” as a primary class language function. The “val” of scala is similar to Java final variables.  Contents can also mutate but pinnacle  reference is immutable.

Scala shall we ‘if blocks’, ‘for-yield loops’, and ‘code’ in braces to return a fee. It is extra foremost, and gets rid of the need for a separate ternary operator.

Singleton has singleton objects instead of C++/Java/ C# traditional static.  It is a cleaner answer.

 Persistent immutable collections are the default and built into the standard library.

It has native tuples and a concise code.

It has no boiler plate code.

Q45. What are the Scala variables?

Ans: Values and variables are  shapes that are available Scala. A fee variable is regular and cannot be modified once assigned.  It is immutable, whilst a regular variable, then again, is mutable, and you may alternate the value.

The two sorts of variables are:

var  myVar : Int=zero;

val   myVal: Int=1;

Q46. Mention the distinction between an object and a class ?

Ans: A class is a definition for an outline.  It defines a type in phrases of methods and composition of different kinds.  A class is a blueprint of the item. While, an item is a singleton, an instance of a class that's particular. An nameless elegance is created for every object inside the code, it inherits from anything instructions you declared object to enforce.

Q47. What is recursion tail in scala?

Ans: ‘Recursion’ is a characteristic that calls itself. A characteristic that calls itself, as an example, a feature ‘A’ calls characteristic ‘B’, which calls the characteristic ‘C’.  It is a way used often in practical programming.  In order for a tail recursive, the call returned to the characteristic ought to be the ultimate characteristic to be finished.

Q48. What is ‘scala trait’ in scala?

Ans: ‘Traits’ are used to outline item kinds distinctive via the signature of the supported techniques.  Scala allows to be partly applied however tendencies may not have constructor parameters.  A trait consists of technique and field definition, by way of mixing them into instructions it is able to be reused.

Q49. When can you operate trends?

Ans: There is not any precise rule whilst you could use developments, however there's a tenet which you may don't forget.

If the behaviour will now not be reused, then make it a concrete elegance. Anyhow it is not a reusable behaviour.

In order to inherit from it in Java code, an abstract class can be used.

If performance is a concern then lean closer to the use of a category.

Make it a trait if it might be reused in multiple and unrelated lessons. In exclusive components of the class hierarchy handiest tendencies may be combined into extraordinary elements.

You can use abstract elegance, in case you need to distribute it in compiled form and expects outside corporations to write lessons inheriting from it.

Q50. What is Case Classes?

Ans: Case classes presents a recursive decomposition mechanism thru pattern matching, it's miles a normal classes which export their constructor parameter. The constructor parameters of case classes can be accessed immediately and are handled as public values.

Q51. What is the usage of tuples in scala?

Ans: Scala tuples combine a hard and fast variety of items together so that they may be passed around as entire. A tuple is immutable and may hold objects with differing types, in contrast to an array or list.

Q52. What is characteristic currying in Scala?

Ans: Currying is the technique of reworking a characteristic that takes multiple arguments into a function that takes a single argument Many of the same techniques as language like Haskell and LISP are supported through Scala. Function currying is one of the least used and misunderstood one.

Q53. What are implicit parameters in Scala?

Ans: Implicit parameter is the manner that permits parameters of a technique to be “determined”.  It is much like default parameters, however it has a extraordinary mechanism for locating the “default” fee.  The implicit parameter is a parameter to approach or constructor this is marked as implicit.  This means if a parameter fee is not mentioned then the compiler will search for an “implicit” cost defined within a scope.

Q54. What is a closure in Scala?

Ans: A closure is a characteristic whose go back value relies upon on the value of the variables declared outdoor the function.

Q55. What is Monad in Scala?

Ans: A monad is an item that wraps any other item. You bypass the Monad mini-programs, i.E features, to carry out the information manipulation of the underlying object, instead of manipulating the object immediately.  Monad chooses a way to apply this system to the underlying item.

Q56. What is Scala anonymous feature?

Ans: In a source code, nameless capabilities are referred to as ‘characteristic literals’ and at run time, function literals are instantiated into objects called function values.  Scala gives a quite easy syntax for outlining anonymous features.

Q57. Explain ‘Scala better order’ features?

Ans: Scala lets in the definition of higher order features.  These are functions that take different functions as parameters, or whose end result is a function.  In the following instance, apply () feature takes another characteristic ‘f’ and a cost ‘v’ and applies feature to v.