Top MapReduce Interview Questions and Answers
by sonia, on May 24, 2017 3:35:41 PM
Q1a. What is mapreduce?
Ans: MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
What is MapReduce?
Referred as the core of Hadoop, MapReduce is a programming framework to process large sets of data or big data across thousands of servers in a Hadoop Cluster. The concept of MapReduce is similar to the cluster scale-out data processing systems. The term MapReduce refers to two important processes of Hadoop program operates.
First is the map() job, which converts a set of data into another breaking down individual elements into key/value pairs (tuples). Then comes reduce() job into play, wherein the output from the map, i.e. the tuples serve as the input and are combined into smaller set of tuples. As the name suggests, the map job every time occurs before the reduce one.
Q1b. What is Hadoop Map Reduce ?
Ans: For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process.
Q2. How Hadoop MapReduce works?
Ans: In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
Q3. Explain what is shuffling in MapReduce ?
Ans: The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle
Q4. Explain what is distributed Cache in MapReduce Framework ?
Ans: Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.
Q5. Explain what is NameNode in Hadoop?
Ans: NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines
Q6. Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
Ans: In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own JVM process
Hadoop performs following actions in Hadoop
- Client application submit jobs to the job tracker
- JobTracker communicates to the Namemode to determine data location
- Near the data or with available slots JobTracker locates TaskTracker nodes
- On chosen TaskTracker Nodes, it submits the work
- When a task fails, Job tracker notify and decides what to do then.
- The TaskTracker nodes are monitored by JobTracker
Q7. Explain what is heartbeat in HDFS?
Ans: Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker
Q8. Explain what combiners is and when you should use a combiner in a MapReduce Job?
Ans: To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop
Q9. What happens when a datanode fails ?
Ans: When a datanode fails
- Jobtracker and namenode detect the failure
- On the failed node all tasks are re-scheduled
- Namenode replicates the users data to another node
Q10. Explain what is Speculative Execution?
Ans: In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.
Q11. Explain what are the basic parameters of a Mapper?
Ans: The basic parameters of a Mapper are
- LongWritable and Text
- Text and IntWritable
Q12. Explain what is the function of MapReducer partitioner?
Ans: The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers
Q13. Explain what is difference between an Input Split and HDFS Block?
Ans: Logical division of data is known as Split while physical division of data is known as HDFS Block
Q14. Explain what happens in textinformat ?
Ans: In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
Q15. Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?
Ans: The user of Mapreduce framework needs to specify
- Job’s input locations in the distributed file system
- Job’s output location in the distributed file system
- Input format
- Output format
- Class containing the map function
- Class containing the reduce function
- JAR file containing the mapper, reducer and driver classes
Q16. Explain what is WebDAV in Hadoop?
Ans: To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
Q17. Explain what is sqoop in Hadoop ?
Ans: To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS
Q18. Explain how JobTracker schedules a task ?
Ans: The task tracker send out heartbeat messages to Jobtracker usually every few minutes to make sure that JobTracker is active and functioning. The message also informs JobTracker about the number of available slots, so the JobTracker can stay upto date with where in the cluster work can be delegated
Q19. Explain what is Sequencefileinputformat?
Ans: Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
Q20. Explain what does the conf.setMapper Class do ?
Ans: Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper
Q21. Explain what is Hadoop?
Ans: It is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides enormous processing power and massive storage for any type of data.
Q22. Mention what is the difference between an RDBMS and Hadoop?
|RDBMS is relational database management system||Hadoop is node based flat structure|
|It used for OLTP processing whereas Hadoop||It is currently used for analytical and for BIG DATA processing|
|In RDBMS, the database cluster uses the same data files stored in shared storage||In Hadoop, the storage data can be stored independently in each processing node.|
|You need to preprocess data before storing it||you don’t need to preprocess data before storing it|
Q23. Mention Hadoop core components?
Ans: Hadoop core components include,
Q24. What is NameNode in Hadoop?
Ans: NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.
Q25. Mention what are the data components used by Hadoop?
Ans: Data components used by Hadoop are
Q26. Mention what is the data storage component used by Hadoop?
Ans: The data storage component used by Hadoop is HBase.
Q27. Mention what are the most common input formats defined in Hadoop?
Ans: The most common input formats defined in Hadoop are;
Q28. In Hadoop what is InputSplit?
Ans: It splits input files into chunks and assign each split to a mapper for processing.
Q29. For a Hadoop job, how will you write a custom partitioner?
Ans: You write a custom partitioner for a Hadoop job, you follow the following path
- Create a new class that extends Partitioner Class
- Override method getPartition
- In the wrapper that runs the MapReduce
- Add the custom partitioner to the job by using method set Partitioner Class or – add the custom partitioner to the job as a config file
Q30. For a job in Hadoop, is it possible to change the number of mappers to be created?
Ans: No, it is not possible to change the number of mappers to be created. The number of mappers is determined by the number of input splits.
Q31. Explain what is a sequence file in Hadoop?
Ans:To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed.
Q32. When Namenode is down what happens to job tracker?
Ans: Namenode is the single point of failure in HDFS so when Namenode is down your cluster will set off.
Q33. Explain how indexing in HDFS is done?
Ans: Hadoop has a unique way of indexing. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which say where the next part of the data will be.
Q34. Explain is it possible to search for files using wildcards?
Ans: Yes, it is possible to search for files using wildcards.
Q35. List out Hadoop’s three configuration files?
Ans:The three configuration files are
Q36. Explain how can you check whether Namenode is working beside using the jps command?
Ans: Beside using the jps command, to check whether Namenode are working you can also use
Q37. Explain what is “map” and what is "reducer" in Hadoop?
Ans: In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.
In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
Q38. In Hadoop, which file controls reporting in Hadoop?
Ans: In Hadoop, the hadoop-metrics.properties file controls reporting.
Q39. For using Hadoop list the network requirements?
Ans:For using Hadoop the list of network requirements are:
- Password-less SSH connection
- Secure Shell (SSH) for launching server processes
Q40. Mention what is rack awareness?
Ans: Rack awareness is the way in which the namenode determines on how to place blocks based on the rack definitions.
Q41. Explain what is a Task Tracker in Hadoop?
Ans: A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker. It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive.
Q42. Mention what daemons run on a master node and slave nodes?
- Daemons run on Master node is "NameNode"
- Daemons run on each Slave nodes are “Task Tracker” and "Data"
Q43. Explain how can you debug Hadoop code?
Ans: The popular methods for debugging Hadoop code are:
- By using web interface provided by Hadoop framework
- By using Counters
Q44. Explain what is storage and compute nodes?
- The storage node is the machine or computer where your file system resides to store the processing data
- The compute node is the computer or machine where your actual business logic will be executed.
Q45. Mention what is the use of Context Object?
Ans: The Context Object enables the mapper to interact with the rest of the Hadoop
system. It includes configuration data for the job, as well as interfaces which allow it to emit output.
Q46. Mention what is the next step after Mapper or MapTask?
Ans: The next step after Mapper or MapTask is that the output of the Mapper are sorted, and partitions will be created for the output.
Q47. Mention what is the number of default partitioner in Hadoop?
Ans: In Hadoop, the default partitioner is a “Hash” Partitioner.
Q48. Explain what is the purpose of RecordReader in Hadoop?
Ans: In Hadoop, the RecordReader loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper.
Q49. Explain how is data partitioned before it is sent to the reducer if no custom partitioner is defined in Hadoop?
Ans: If no custom partitioner is defined in Hadoop, then a default partitioner computes a hash value for the key and assigns the partition based on the result.
Q50. Explain what happens when Hadoop spawned 50 tasks for a job and one of the task failed?
Ans: It will restart the task again on some other TaskTracker if the task fails more than the defined limit.
Q51. Mention what is the best way to copy files between HDFS clusters?
Ans: The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.
Q52. Mention what is the difference between HDFS and NAS?
Ans: HDFS data blocks are distributed across local drives of all machines in a cluster while NAS data is stored on dedicated hardware.
Q53. Mention how Hadoop is different from other data processing tools?
Ans: In Hadoop, you can increase or decrease the number of mappers without worrying about the volume of data to be processed.
Q54. Mention what job does the conf class do?
Ans: Job conf class separate different jobs running on the same cluster. It does the job level settings such as declaring a job in a real environment.
Q55. Mention what is the Hadoop MapReduce APIs contract for a key and value class?
Ans: For a key and value class, there are two Hadoop MapReduce APIs contract
- The value must be defining the org.apache.hadoop.io.Writable interface
- The key must be defining the org.apache.hadoop.io.WritableComparable interface
Q56. Mention what are the three modes in which Hadoop can be run?
Ans: The three modes in which Hadoop can be run are
- Pseudo distributed mode
- Standalone (local) mode
- Fully distributed mode
Q57. Mention what does the text input format do?
Ans: The text input format will create a line object that is an hexadecimal number. The value is considered as a whole line text while the key is considered as a line object. The mapper will receive the value as ‘text’ parameter while key as ‘longwriteable’ parameter.
Q58. Mention how many InputSplits is made by a Hadoop Framework?
Ans: Hadoop will make 5 splits
- 1 split for 64K files
- 2 split for 65mb files
- 2 splits for 127mb files
Q59. Mention what is distributed cache in Hadoop?
Ans: Distributed cache in Hadoop is a facility provided by MapReduce framework. At the time of execution of the job, it is used to cache file. The Framework copies the necessary files to the slave node before the execution of any task at that node.
Q60. Explain how does Hadoop Classpath plays a vital role in stopping or starting in Hadoop daemons?
Ans: Classpath will consist of a list of directories containing jar files to stop or start daemons.
Q61. Compare MapReduce and Spark?
|Standalone mode||Needs Hadoop||Can work independently|
|Ease of use||Needs extensive Java program||APIs for Python, Java, & Scala|
|Versatility||Real-time & machine learning applications||Not optimized for real-time & machine learning applications|
Q62. Can MapReduce program be written in any language other than Java?
Ans: Yes, Mapreduce can be written in many programming languages Java, R, C++, scripting Languages (Python, PHP). Any language able to read from stadin and write to stdout and parse tab and newline characters should work . Hadoop streaming (A Hadoop Utility) allows you to create and run Map/Reduce jobs with any executable or scripts as the mapper and/or the reducer.
Q63. Illustrate a simple example of the working of MapReduce.
Ans: Let’s take a simple example to understand the functioning of MapReduce. However, in real-time projects and applications, this is going to be elaborate and complex as the data we deal with Hadoop and MapReduce is extensive and massive.
Assume you have five files and each file consists of two key/value pairs as in two columns in each file – a city name and its temperature recorded. Here, name of city is the key and the temperature is value.
San Francisco, 22
Los Angeles, 15
Los Angeles, 16
It is important to note that each file may consist of the data for same city multiple times. Now, out of this data, we need to calculate the maximum temperature for each city across these five files. As explained, the MapReduce framework will divide it into five map tasks and each map task will perform data functions on one of the five files and returns maxim temperature for each city.
(San Francisco, 22)(Los Angeles, 16)(Vancouver, 30)(London, 25)
Similarly each mapper performs it for the other four files and produce intermediate results, for instance like below.
(San Francisco, 32)(Los Angeles, 2)(Vancouver, 8)(London, 27)
(San Francisco, 29)(Los Angeles, 19)(Vancouver, 28)(London, 12)
(San Francisco, 18)(Los Angeles, 24)(Vancouver, 36)(London, 10)
(San Francisco, 30)(Los Angeles, 11)(Vancouver, 12)(London, 5)
These tasks are then passed to the reduce job, where the input from all files are combined to output a single value. The final results here would be:
Q64. What are the main components of MapReduce Job?
Ans: Main Driver Class: providing job configuration parameters
Mapper Class: must extend org.apache.hadoop.mapreduce.Mapper class and performs execution of map() method
Reducer Class: must extend org.apache.hadoop.mapreduce.Reducer class
Q65. What is Shuffling and Sorting in MapReduce?
Ans: Shuffling and Sorting are two major processes operating simultaneously during the working of mapper and reducer.
The process of transferring data from Mapper to reducer is Shuffling. It is a mandatory operation for reducers to proceed their jobs further as the shuffling process serves as input for the reduce tasks.
In MapReduce, the output key-value pairs between the map and reduce phases (after the mapper) are automatically sorted before moving to the Reducer. This feature is helpful in programs where you need sorting at some stages. It also saves the programmer’s overall time.
Q66. What is Partitioner and its usage?
Ans: Partitioner is yet another important phase that controls the partitioning of the intermediate map-reduce output keys using a hash function. The process of partitioning determines in what reducer, a key-value pair (of the map output) is sent. The number of partitions is equal to the total number of reduce jobs for the process.
Hash Partitioner is the default class available in Hadoop , which implements the following function.int getPartition(K key, V value, int numReduceTasks)
The function returns the partition number using the numReduceTasks is the number of fixed reducers.
Q67. What is Identity Mapper and Chain Mapper?
Ans: Identity Mapper is the default Mapper class provided by Hadoop. when no other Mapper class is defined, Identify will be executed. It only writes the input data into output and do not perform and computations and calculations on the input data.
The class name is org.apache.hadoop.mapred.lib.IdentityMapper.
Chain Mapper is the implementation of simple Mapper class through chain operations across a set of Mapper classes, within a single map task. In this, the output from the first mapper becomes the input for second mapper and second mapper’s output the input for third mapper and so on until the last mapper.
The class name is org.apache.hadoop.mapreduce.lib.ChainMapper.
Q68. What main configuration parameters are specified in MapReduce?
Ans: The MapReduce programmers need to specify following configuration parameters to perform the map and reduce jobs:
- The input location of the job in HDFs.
- The output location of the job in HDFS.
- The input’s and output’s format.
- The classes containing map and reduce functions, respectively.
- The .jar file for mapper, reducer and driver classes.
Q69. Name Job control options specified by MapReduce.
Ans: Since this framework supports chained operations wherein an input of one map job serves as the output for other, there is a need for job controls to govern these complex operations.
The various job control options are:
Job.submit() : to submit the job to the cluster and immediately return
Job.waitforCompletion(boolean) : to submit the job to the cluster and wait for its completion
Q70. What is InputFormat in Hadoop?
Ans: Another important feature in MapReduce programming, InputFormat defines the input specifications for a job. It performs the following functions:
- Validates the input-specification of job.
- Split the input file(s) into logical instances called InputSplit. Each of these split files are then assigned to individual Mapper.
- Provides implementation of RecordReader to extract input records from the above instances for further Mapper processing
Q71. What is the difference between HDFS block and InputSplit?
Ans: An HDFS block splits data into physical divisions while InputSplit in MapReduce splits input files logically.
While InputSplit is used to control number of mappers, the size of splits is user defined. On the contrary, the HDFS block size is fixed to 64 MB, i.e. for 1GB data , it will be 1GB/64MB = 16 splits/blocks. However, if input split size is not defined by user, it takes the HDFS default block size.
Q72. What is Text Input Format?
Ans: It is the default InputFormat for plain text files in a given job having input files with .gz extension. In TextInputFormat, files are broken into lines, wherein key is position in the file and value refers to the line of text. Programmers can write their own InputFormat.
The hierarchy is:
Q73. Explain job scheduling through JobTracker.
Ans: JobTracker communicates with NameNode to identify data location and submits the work to TaskTracker node. The TaskTracker plays a major role as it notifies the JobTracker for any job failure. It actually is referred to the heartbeat reporter reassuring the JobTracker that it is still alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the job or mark a specific record as unreliable or blacklist it.
Q74. What is SequenceFileInputFormat?
Ans: A compressed binary output file format to read in sequence files and extends the FileInputFormat.It passes data between output-input (between output of one MapReduce job to input of another MapReduce job)phases of MapReduce jobs.
Q75. How to set mappers and reducers for Hadoop jobs?
Ans: Users can configure JobConf variable to set number of mappers and reducers.
Q76. Explain JobConf in MapReduce.
Ans: It is a primary interface to define a map-reduce job in the Hadoop for job execution. JobConf specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat implementations and other advanced job faets liek Comparators.
Q77. What is a MapReduce Combiner?
Ans: Also known as semi-reducer, Combiner is an optional class to combine the map out records using the same key. The main function of a combiner is to accept inputs from Map Class and pass those key-value pairs to Reducer class
Q78. What is RecordReader in a Map Reduce?
Ans: RecordReader is used to read key/value pairs form the InputSplit by converting the byte-oriented view and presenting record-oriented view to Mapper.
Q79. Define Writable data types in MapReduce.
Ans: Hadoop reads and writes data in a serialized form in writable interface. The Writable interface has several classes like Text (storing String data), IntWritable, LongWriatble, FloatWritable, BooleanWritable. users are free to define their personal Writable classes as well.
Q80. What is OutputCommitter?
Ans: OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the default available class available for OutputCommitter in MapReduce. It performs the following operations:
- Create temporary output directory for the job during initialization.
- Then, it cleans the job as in removes temporary output directory post job completion.
- Sets up the task temporary output.
- Identifies whether a task needs commit. The commit is applied if required.
- JobSetup, JobCleanup and TaskCleanup are important tasks during output commit.
Q81. What is a “map” in Hadoop?
Ans: In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.
Q82. What is a “reducer” in Hadoop?
Ans: In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
Q83. What are the parameters of mappers and reducers?
Ans: The four parameters for mappers are:
- LongWritable (input)
- text (input)
- text (intermediate output)
- IntWritable (intermediate output)
The four parameters for reducers are:
- Text (intermediate output)
- IntWritable (intermediate output)
- Text (final output)
- IntWritable (final output)
Q84. What are the key differences between Pig vs MapReduce?
Ans: PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to
process 1. taking output and feeding it to
process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.
Mapreduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that its easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting uniques in a dataset.
Q85. How to set which framework would be used to run mapreduce program?
Ans: mapreduce.framework.name. it can be
Q86. What platform and Java version is required to run Hadoop?
Ans: Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.