Apache Spark Interview Questions and Answers
Q1. What is Apache Spark ?
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
Q2. What is sparkContext?
SparkContext is the entry point to Spark. Using sparkContext you create RDDs which provided various ways of churning data.
Q3. Why is Spark faster than MapReduce?
A. There are few important reasons why Spark is faster than MapReduce and some of them are below:
- There is no tight coupling in Spark i.e., there is no mandatory rule that reduce must come after map.
- Spark tries to keep the data “in-memory” as much as possible.
In MapReduce, the intermediate data will be stored in HDFS and hence takes longer time to get the data from a source but this is not the case with Spark.
Q4. Explain the Apache Spark Architecture.
- Apache Spark application contains two programs namely a Driver program and Workers program.
- A cluster manager will be there in-between to interact with these two cluster nodes. Spark Context will keep in touch with the worker nodes with the help of Cluster Manager.
- Spark Context is like a master and Spark workers are like slaves.
- Workers contain the executors to run the job. If any dependencies or arguments have to be passed then Spark Context will take care of that. RDD’s will reside on the Spark Executors.
- You can also run Spark applications locally using a thread, and if you want to take advantage of distributed environments you can take the help of S3, HDFS or any other storage system
Q5. What are the key features of Spark.
- Allows Integration with Hadoop and files included in HDFS.
- Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
- Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
- Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing
Q6. What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.
Q7. On which all platform can Apache Spark run?
Spark can run on the following platforms:
- YARN (Hadoop): Since yarn can handle any kind of workload, the spark can run on Yarn. Though there are two modes of execution. One in which the Spark driver is executed inside the container on node and second in which the Spark driver is executed on the client machine. This is the most common way of using Spark.
- Apache Mesos: Mesos is an open source good upcoming resource manager. Spark can run on Mesos.
- EC2: If you do not want to manage the hardware by yourself, you can run the Spark on top of Amazon EC2. This makes spark suitable for various organizations.
- Standalone: If you have no resource manager installed in your organization, you can use the standalone way. Basically, Spark provides its own resource manager. All you have to do is install Spark on all nodes in a cluster, inform each node about all nodes and start the cluster. It starts communicating with each other and run.
Q8. What are the various programming languages supported by Spark?
Though Spark is written in Scala, it lets the users code in various languages such as:
- R (Using SparkR)
- SQL (Using SparkSQL)
Also, by the way of piping the data via other commands, we should be able to use all kinds of programming languages or binaries.
Q9. Compare Spark vs Hadoop MapReduce
|Criteria||Hadoop MapReduce||Apache Spark|
|Memory||Does not leverage the memory of the hadoop cluster to maximum.||Let’s save data on memory with the use of RDD’s.|
|Disk usage||MapReduce is disk oriented.||Spark caches data in-memory and ensures low latency.|
|Processing||Only batch processing is supported||Supports real-time processing through spark streaming.|
|Installation||Is bound to hadoop.||Is not bound to Hadoop.|
Q10. What are actions and transformations?
Transformations create new RDD’s from existing RDD and these transformations are lazy and will not be executed until you call any action.
Eg: map(), filter(), flatMap(), etc.,
Actions will return results of an RDD.
Eg: reduce(), count(), collect(), etc.,
Q11. What are the various storages from which Spark can read data?
Spark has been designed to process data from various sources. So, whether you want to process data stored in HDFS, Cassandra, EC2, Hive, HBase, and Alluxio (previously Tachyon). Also, it can read data from any system that supports any Hadoop data source.
Q12. List some use cases where Spark outperforms Hadoop in processing.
- Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
- Spark is preferred over Hadoop for real time querying of data
- Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
Q13. What is Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
Q14. What are Accumulators?
Accumulators are the write only variables which are initialized once and sent to the workers. These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.
Only driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number errors seen in RDD across workers.
Q15 . What is Hive on Spark?
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome; hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
Q16. What are Broadcast Variables?
Broadcast Variables are the read-only shared variables. Suppose, there is a set of data which may have to be used multiple times in the workers at different phases, we can share all those variables to the workers from the driver and every machine can read them.
Q17. What are the optimizations that developer can make while working with spark?
1.Spark is memory intensive, whatever you do it does in memory.
2.Firstly, you can adjust how long spark will wait before it times out on each of the phases of data locality (data local –> process local –> node local –> rack local –> Any)
3.Filter out data as early as possible. For caching, choose wisely from various storage levels.
4.Tune the number of partitions in spark.
Q18. What is Spark SQL?
Spark SQL is a module for structured data processing where we take advantage of SQL queries running on the datasets.
Q19. What is Spark Streaming?
Whenever there is data flowing continuously and you want to process the data as early as possible, in that case you can take the advantage of Spark Streaming. It is the API for stream processing of live data. Data can flow for Kafka, Flume or from TCP sockets, Kenisis etc., and you can do complex processing on the data before you pushing them into their destinations. Destinations can be file systems or databases or any other dashboards.
Q20. What is Sliding Window?
In Spark Streaming, you have to specify the batch interval. For example, let’s take your batch interval is 10 seconds, Now Spark will process the data whatever it gets in the last 10 seconds i.e., last batch interval time.But with Sliding Window, you can specify how many last batches has to be processed. In the below screen shot, you can see that you can specify the batch interval and how many batches you want to process. Apart from this, you can also specify when you want to process your last sliding window. For example you want to process the last 3 batches when there are 2 new batches. That is like when you want to slide and how many batches has to be processed in that window.
Q21. What does MLlib do?
MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
Q22. List the functions of Spark SQL.?
Spark SQL is capable of:
- Loading data from a variety of structured sources.
- Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau.
- Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
Q23. How can Spark be connected to Apache Mesos?
To connect Spark with Mesos-
- Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)
- Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.
Q24. Is it possible to run Spark and Mesos along with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
Q25. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
Q26. What is Catalyst framework?
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Q27. Name a few companies that use Apache Spark in production.
Pinterest, Conviva, Shopify, Open Table
Q28. Why is BlinkDB used?
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.
Q29. What are the common mistakes developers make when running Spark applications?
Developers often make the mistake of-
- Hitting the web service several times by using multiple clusters.
- Run everything on the local node instead of distributing it.
Developers need to be careful with this, as Spark makes use of memory for processing.
Q30. What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps –
- Limit I/O operations
- Consumes less space
- Fetches only required columns.
January 29, 2019
May 26, 2018
May 26, 2018