Posts

Showing posts with the label interview-questions-on-hadoop

Featured Post

14 Top Data Pipeline Key Terms Explained

Image
 Here are some key terms commonly used in data pipelines 1. Data Sources Definition: Points where data originates (e.g., databases, APIs, files, IoT devices). Examples: Relational databases (PostgreSQL, MySQL), APIs, cloud storage (S3), streaming data (Kafka), and on-premise systems. 2. Data Ingestion Definition: The process of importing or collecting raw data from various sources into a system for processing or storage. Methods: Batch ingestion, real-time/streaming ingestion. 3. Data Transformation Definition: Modifying, cleaning, or enriching data to make it usable for analysis or storage. Examples: Data cleaning (removing duplicates, fixing missing values). Data enrichment (joining with other data sources). ETL (Extract, Transform, Load). ELT (Extract, Load, Transform). 4. Data Storage Definition: Locations where data is stored after ingestion and transformation. Types: Data Lakes: Store raw, unstructured, or semi-structured data (e.g., S3, Azure Data Lake). Data Warehous...

Top 100 Hadoop Complex interview questions (Part 4 of 4)

Image
Hadoop framework is most popular in data analytics and data related projects. I have given here my 4th set of questions for you to read quickly. 1) What is MapReduce? Ans) It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming. 2). What are ‘maps’ and ‘reduces’? Ans). ‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key-value pair, that is, an intermediate output in the local machine. ’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output. 3). What are the four basic parameters of a mapper? Ans) The four basic parameters of a mapper are LongWritable, text, text, and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters. 4). What are the four basic parame...

Top 100 Hadoop Complex Interview Questions (Part 3 of 4)

Image
These are complex Hadoop interview questions. This is my 3rd set of questions useful for your interviews (3 of 4).      1). What are the features of Standalone (local) mode? Ans). In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the least used environments. 2). What are the features of Pseudo mode? Ans). The pseudo mode is used both for development and in the QA environment. In the Pseudo mode, all the daemons run on the same machine. 3). Can we call VMs as pseudos? Ans). No, VMs are not pseudos because VM is something different and pseudo is very specific to Hadoop. 4). What are the features of Fully Distributed mode? Ans). The fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of mac...

Top 100 Hadoop Complex Interview Questions (Part 2 of 4)

Image
I am giving a series of Hadoop interview questions. This is my 2nd set of questions. You can get quick benefits by reading these questions from start to end. 1). If a data Node is full how it’s identified? Ans). When data is stored in a data node, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full. 2). If data nodes increase, then do we need to upgrade Namenode? Ans). While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise. 3). Are job tracker and task trackers present in separate machines? Ans). Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. 4). When we send a da...

Top 100 Hadoop Complex Interview Questions (Part 1 of 4)

Image
The below list is complex interview questions as part of Hadoop tutorial (part 1 of 4) you can go through these questions quickly. 1. What is BIG DATA? Ans). Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques. 2. Can you give some examples of Big Data? Ans). There are many real-life examples of Big Data! Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of sensor data for every 30 minutes of flying time. All these are a day to day examples of Big Data! 3. Can you give a detailed overview of the Big Data being generated by Facebook?   Ans). As of December 31, 2012, there are 1.06 billion monthly active users on Facebook and 680 million mobile users. On an avera...

Big data: Quiz-1 Hadoop Top Interview Questions

Image
In this post, I have given a Quiz on Big data with answers. This is part-1 set of questions for your quick reference. Photo credit: Srini Q.1) How Hadoop achieve scaling in terms of storage? A.By increasing the hard disk capacity of the machine B.By increasing the RAM capacity of the machine C.By increasing both the hard disk and RAM capacity of the machine D.By increasing the hard disk capacity of the machine and by adding more machine Q.2) How fault tolerance with respect to data is achieved in Hadoop? A.By breaking the data into smaller blocks and distributing these smaller blocks into several machines B.By adding extra nodes. C.By breaking the data into smaller blocks and copying each block several times, and distributing these replicas across several machines. By doing this Hadoop makes sure even if the machines are failed the replica is present in some other machine D.None of these Q.3) In what all parameters Hadoop scales up? A. Storage only B. Performan...