Hadoop interview questions and answers on advance and basic Hadoop with example so this page for both freshers and experienced condidate. Fill the form below we will send the all interview questions on Hadoop also add your Questions if any you have to ask and for apply
in Hadoop Tutorials and Training course just send a mail on firstname.lastname@example.org in detail about your self.
Top Hadoop interview questions and answers for freshers and experienced
What is Hadoop ?
Answer : Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The project itself includes a variety of other complementary addition
Questions : 1 :: What is Hadoop framework?
Hadoop is a open source framework which is written in java by apche software foundation. This framework is used to wirite software application which requires to process vast amount of data (It could...View answers
Questions : 2 :: On What concept the Hadoop framework works?
It works on MapReduce, and it is devised by the Google.
Questions : 3 :: What is MapReduce ?
Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce. The main MapReduce job usually splits the input...View answers
Questions : 4 :: What is compute and Storage nodes?
Compute Node: This is the computer or machine where your actual business logic will be executed.Storage Node: This is the computer or machine where your file system reside to store the processing...View answers
Questions : 5 :: How does master slave architecture in the Hadoop?
The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node will have one TaskskTracker.The master is responsible for scheduling the jobs' component tasks...View answers
Questions : 6 :: How does an Hadoop application look like or their basic components?
Minimally an Hadoop application would have following components. Input location of data Output location of processed data. A map task. A reduced task. Job...View answers
Questions : 7 :: Explain how input and output data format of the Hadoop framework?
The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of...View answers
Questions : 8 :: What are the restriction to the key and value class ?
The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be...View answers
Questions : 9 :: Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair...View answers
Questions : 11 :: What is the InputSplit in map reduce software?
An InputSplit is a logical representation of a unit (A chunk) of input work for a map task; e.g., a filename and a byte range within that file to process or a row set in a text...View answers
Questions : 12 :: What is the InputFormat ?
The InputFormat is responsible for enumerate (itemise) the InputSplits, and producing a RecordReader which will turn those logical work units into actual physical input records.
Questions : 13 :: Where do you specify the Mapper Implementation?
Generally mapper implementation is specified in the Job itself.
Questions : 14 :: How Mapper is instantiated in a running job?
The Mapper itself is instantiated in the running job, and will be passed aMapContext object which it can use to configure itself.
Questions : 15 :: Which are the methods in the Mapper interface?
The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. All above methods you can override...View answers
Questions : 16 :: What happens if you don't override the Mapper methods and keep them as it is?
If you do not override any methods (leaving even map as-is), it will act as the identity function, emitting each input record as a separate output.
Questions : 17 :: What is the use of Context object?
The Context object allows the mapper to interact with the rest of the Hadoop system. ItIncludes configuration data for the job, as well as interfaces which allow it to emit output.
Questions : 18 :: How can you add the arbitrary key-value pairs in your mapper?
You can set arbitrary (key, value) pairs of configuration data in your Job, e.g. with Job.getConfiguration().set("myKey", "myVal"), and then retrieve this data in your mapper with...View answers
Questions : 19 :: How does Mapper's run() method works?
The Mapper.run() method then calls map(KeyInType, ValInType, Context) for each key/value pair in the InputSplit for that task
Questions : 20 :: What is next step after Mapper or MapTask?
The output of the Mapper are sorted and Partitions will be created for the output. Number of partition depends on the number of reducer.
Questions : 21 :: How can we control particular key should go in a specific reducer?
Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
Questions : 22 :: What is the use of Combiner?
It is an optional component or class, and can be specify via Job.setCombinerClass(ClassName), to perform local aggregation of the
intermediate outputs, which helps to cut down the amount...View answers
Questions : 23 :: How many maps are there in a particular Job?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.Generally it is around 10-100 maps per-node. Task setup takes awhile, so...View answers
Questions : 24 :: What is the Reducer used for?
Reducer reduces a set of intermediate values which share a key to a (usually smaller) set of values.The number of reduces for the job is set by the user...View answers
Questions : 25 :: Explain the core methods of the Reducer?
The API of Reducer is very similar to that of Mapper, there's a run() method that receives a Context containing the job's configuration as well as interfacing methods that return data from the...View answers
Questions : 26 :: What are the primary phases of the Reducer?
Shuffle, Sort and Reduce
Questions : 27 :: Explain the shuffle?
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Questions : 28 :: Explain the Reducer's Sort phase?
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched...View answers
Questions : 29 :: Explain the Reducer's reduce phase?
In this phase the reduce(MapOutKeyType, Iterable, Context) method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via...View answers
Questions : 30 :: How many Reducers should be configured?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapreduce.tasktracker.reduce.tasks.maximum).With 0.95 all of the reduces can launch immediately and start...View answers
Questions : 31 :: It can be possible that a Job has 0 reducers?
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
Questions : 32 :: What happens if number of reducers are 0?
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the...View answers
Questions : 33 :: What is the JobTracker and what it performs in a Hadoop Cluster?
JobTracker is a daemon service which submits and tracks the MapReduce tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a separate machine, and each slave node is...View answers
Questions : 34 :: How a task is scheduled by a JobTracker?
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number...View answers
Questions : 35 :: How many instances of Tasktracker run on a Hadoop cluster?
There is one Daemon Tasktracker process for each slave node in theHadoop cluster.
Questions : 36 :: What are the two main parts of the Hadoop framework?
Hadoop consists of two main parts Hadoop distributed file system, a distributed file system with high throughput, Hadoop MapReduce, a software framework for processing large data...View answers
Questions : 37 :: Explain the use of TaskTracker in the Hadoop cluster?
A Tasktracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in its own JVM Process.Every TaskTracker is...View answers
Questions : 38 :: What do you mean by TaskInstance?
Task instances are the actual MapReduce jobs which run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that...View answers
Questions : 39 :: How many daemon processes run on a Hadoop cluster?
Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.
Following 3 Daemons run on Master nodes.NameNode - This daemon stores and maintains the metadata for...View answers
Questions : 40 :: How many maximum JVM can run on a slave node?
One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically...View answers
Questions : 41 :: What is NAS?
It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as...View answers
Questions : 42 :: How HDFA differs with NFS?
Following are differences between HDFS and NAS1.o In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is storedon dedicated hardware.o HDFS is...View answers
Questions : 43 :: How does a NameNode handle the failure of the data nodes?
HDFS has master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.In addition, there...View answers
Questions : 44 :: Where the Mapper's Intermediate data will be stored?
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by...View answers
Questions : 45 :: What is the use of Combiners in the Hadoop framework?
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount...View answers
Questions : 46 :: What is the Hadoop MapReduce API contract for a key and value Class?
◦The Key must implement the org.apache.hadoop.io.WritableComparableinterface.◦The value must implement the org.apache.hadoop.io.Writable interface.
Questions : 47 :: What is the meaning of speculative execution in Hadoop? Why is it important?
Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as...View answers
Questions : 48 :: When the reducers are are started in a MapReduce job?
In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are...View answers
Questions : 49 :: What is HDFS ? How it is different from traditional file systems?
HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware.
It has many similarities...View answers
Questions : 50 :: What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size.Each block is replicated multiple times. Default is to replicate...View answers
Questions : 51 :: What is a NameNode? How many instances of NameNode run on a Hadoop
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the...View answers
Questions : 52 :: How the Client communicates with HDFS?
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS....View answers
Questions : 53 :: How the HDFS Blocks are replicated?
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.The...View answers