Posts

Showing posts with the label Hadoop

Featured Post

14 Top Data Pipeline Key Terms Explained

Image
 Here are some key terms commonly used in data pipelines 1. Data Sources Definition: Points where data originates (e.g., databases, APIs, files, IoT devices). Examples: Relational databases (PostgreSQL, MySQL), APIs, cloud storage (S3), streaming data (Kafka), and on-premise systems. 2. Data Ingestion Definition: The process of importing or collecting raw data from various sources into a system for processing or storage. Methods: Batch ingestion, real-time/streaming ingestion. 3. Data Transformation Definition: Modifying, cleaning, or enriching data to make it usable for analysis or storage. Examples: Data cleaning (removing duplicates, fixing missing values). Data enrichment (joining with other data sources). ETL (Extract, Transform, Load). ELT (Extract, Load, Transform). 4. Data Storage Definition: Locations where data is stored after ingestion and transformation. Types: Data Lakes: Store raw, unstructured, or semi-structured data (e.g., S3, Azure Data Lake). Data Warehous...

6 Exclusive Differences Between Structured and Unstructured data

Image
Here's a basic interview question for Big data engineers. Why it's basic means many Bachelor degrees now offering courses on Big data, as a beginner, understanding of data is a little tricky. So interviewers stress this point. Don't worry, I made it simplified. So you get a clear concept. I share here a total of six differences between these. In today's world, we have a lot of data. That data is the unstructured format.   Structured Data The major data format is text, which can be string or numeric. The date is also supported. The data model is fixed before inserting the data. Data is stored in the form of a table, making it easy to search. Not easy to scale. Version is maintained as a column in the table. Transaction management and concurrency are easy to support. Unstructured data The data format can be anything from text to images, audio to videos. The data model cannot be fixed since the nature of the data can change. Consider a tweet message that could be text foll...

Here is Hadoop MapReduce DataFlow Tutorial

Image
Here are the six stages of MapReduce. The MapReduce is critical for your data processing needs. Traditionally, the whole file needs to read once then divided manually, but it is not convenient. With that respect, Hadoop provides the facility to read files (ignoring their size) line-for-line by using offset and key-value. MapReduce dataflow Quick Tutorial 1. Dataflow Diagram 2. MapReduce Stages MapReduce receives input and processes it. Here are the six stages of processing . It is helpful for your interviews and project. MapReduce Stage-1 Take the file as input for processing purposes. Any file will consist of a group of lines. These lines containing key-value pairs of data. The whole file can be read out with this method. MapReduce Stage-2 In the next step, the file will be in "splitting" mode. This mode will divide the file into key, value pair of data. This time key will be offset and data will be a valuable part of the program. Each line will be read individually so there...

5 HBase Vs. RDBMS Top Functional Differences

Image
Here're the differences between RDBMS and HBase. HBase in the Big data context has a lot of benefits over RDBMS. The listed differences below make it understandable why HBASE is popular in Hadoop (or Bigdata) platform. 5 HBase Vs. RDBMS Top Functional Differences Here're the differences unlock now. Random Accessing HBase handles a large amount of data that is store in a distributed manner in the column-oriented format while RDBMS is systematic storage of a database that cannot support a random manner for accessing the database. Database Rules RDBMS strictly follows Codd's 12 rules with fixed schemas and row-oriented manner of database and also follows ACID properties. HBase follows BASE properties and implements complex queries. Secondary indexes, complex inner and outer joins, count, sum, sort, group, and data of page and table can easily be accessible by RDBMS. Storage From small to medium storage application there is the use of RDBMS that provides the solution with MySQ...

Hadoop fs (File System) Commands List

Image
Hadoop HDSF File system commands given in this post. These are useful for your projects and interviews. HDFS commands HDFS File System Commands. Hadoop fs -cmd <args> cmd is a specific command and arg is the variable name.  The List of Commands cat  Hadoop fs –cat FILE [FILE …]  Displays the files' content. For reading compressed files.  chgrp  Hadoop fs –chgrp [-R] GROUP PATH [PATH …]  Changes the group association for files and directories. The – R option applies the change recursively.  The user must be the files' owner or a superuser.  chmod  Hadoop fs –chmod [-R] MODE[,MODE …] PATH [PATH …]  Changes the permissions of files and directories. Like, its Unix equivalent, MODE can be a 3-digit octal mode, or {augo}+/-{rwxX}. The -R option applies the change recursively. The user must be the files' owner or a superuser.  chown  Hadoop fs –chown [-R] [OWNER][:[GROUP]] PATH [PATH…]  Changes the ownership of files and di...

Hadoop Vs RDBMS Real Differences

Image
Hadoop comes into the picture to process a large volume of unstructured data. The structured data is already taken care of by traditional databases. Traditional databases. Traditional relational databases have been able to store massive data sets for a long time. An Oracle 10g database can store over 8 Petabytes while for many years DB2 databases have been capable of storing well over 500 Petabytes. Of course, this is all theoretical.  No customer has an Oracle or DB2 database that approaches sizes even close to that. Why? Because the speed, or velocity, at which data can be loaded and queries can be executed approaches zero well before then. Similarly, all traditional relational databases can store any variety of data as text or binary large objects. The problem is that large volumes of unstructured data cannot be moved fast enough to enable rapid search and retrieval. Hadoop Processing. Running constant and predictable workloads is what your existing data warehouse ha...

11 Top PIG Interview Questions

Here are the top PIG interview questions. These are useful for your project and interviews. 1). What is PIG? PIG is a platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.  PIG’s infrastructure layer consists of a compiler that produces a sequence of MapReduce Programs. 2). What is the difference between logical and physical plans? Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan.  The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script. 3). Does ‘ILLUSTRATE’ run MR job? No, illustrate will not pull any MR, it will pull the internal data. On the console, illustr...

AWS EMR Vs. Hadoop: 5 Top Differences

Image
With Amazon Elastic MapReduce Amazon EMR, you can analyze and process vast amounts of data. It distributes the computational work across a cluster of virtual servers ( run in the Amazon cloud). An open-source framework of Hadoop manages it.  Amazon EMR - Elastic MapReduce, The Unique Features Amazon EMR has made enhancements to Hadoop and other open-source applications to work seamlessly with AWS. For instance, Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes,  Amazon S3   for bulk storage of input and output data, and CloudWatch to monitor cluster performance and raise alarms. Also, you can move data into and out of DynamoDB using Amazon EMR and Hive. That orchestrates by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR cluster. What does Hadoop do? Hadoop uses a  distributed processing  architecture called MapReduce, in which a task ma...

How Hadoop is Better for Legacy data

Image
Here is an interview question on legacy data. You all know that a lot of data is available on legacy systems. You can use Hadoop to process the data for useful insights. 1. How should we be thinking about migrating data from legacy systems? Treat legacy data as you would any other complex data type.  HDFS acts as an active archive, enabling you to cost-effectively store data in any form for as long as you like and access it when you wish to explore the data. And with the latest generation of data wrangling and ETL tools, you can transform, enrich, and blend that legacy data with other, newer data types to gain a unique perspective on what’s happening across your business. 2. What are your thoughts on getting combined insights from the existing data warehouse and Hadoop? Typically one of the starter use cases for moving relational data off a warehouse and into Hadoop is active archiving.  This is the opportunity to take data that might have otherwise gone to the archive and k...

The Ultimate Cheat Sheet On Hadoop

Image
Top 20 frequently asked questions to test your Hadoop knowledge given in the below Hadoop cheat sheet . Try finding your own answers and match the answers given here. Question #1  You have written a MapReduce job that will process 500 million input records and generate 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is a potential bottleneck. A custom implementation of which of the following interfaces is most likely to reduce the amount of intermediate data transferred across the network? A. Writable B. WritableComparable C. InputFormat D. OutputFormat E. Combiner F. Partitioner Ans: e Question #2  Where is Hive metastore stored by default ? A. In HDFS B. In client machine in the form of a flat file. C. In client machine in a derby database D. In lib directory of HADOOP_HOME, and requires HADOOP_CLASSPATH to be modif...

Big data: Quiz-2 Hadoop Top Interview Questions

Image
I hope you enjoyed my previous post. This is second set of Questions exclusively for Big data engineers. Read QUIZ-1 . Q.1) You have submitted a job on an input file which has 400 input splits in HDFS. How many map tasks will run? A. At most 400. B. At least 400 C. Between 400 and 1200. D. Between 100 and 400. Ans: c QUESTION 2 What is not true about LocalJobRunner mode? Choose two A. It requires JobTracker up and running. B. It runs Mapper and Reducer in one single process C. It stores output in local file system D. It allows use of Distributed Cache. Ans: d,a Hadoop Jobs and Career QUESTION 3 What is the command you will use to run a driver named “SalesAnalyisis” whose compilped code is available in a jar file “SalesAnalytics.jar” with input data in directory “/sales/data” and output in a directory “/sales/analytics”? A. hadoopfs  –jar  SalesAnalytics.jar  SalesAnalysis  -input  /sales/data  -output /sales/analysis B. hadoo...

Top Key Architecture Components in HIVE

5 architectural components present in Hadoop Hive: Shell: allows interactive queries like MySQL shell connected to a database – Also supports web and JDBC clients Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (M/R, HDFS, or metadata) Metastore: schema, location in HDFS, SerDe Data Mode of Hive: Tables – Typed columns (int, float, string, date, boolean) – Also, list: map (for JSON-like data) Partitions – e.g., to range-partition tables by date Buckets – Hash partitions within ranges (useful for sampling, join optimization) HIVE Meta Store Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Partition data  Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases Physical Layout of HIVE Warehouse directory in HDFS – e.g., /home/hive/warehouse Tables stored in subdirectories of warehouse – Partitions, buc...

What is MapR? Here are Top Features to Use in Data Analytics

Image
In the following post I have given information about MapR and its popular features. The MapR’ is a San Jose, California-based organization code corporation that progresses and vends Apache Hadoop-derived code. The corporation gives to Apache Hadoop programs like HBase, Pig (programming language), Apache Hive, and Apache ZooKeeper. Apache Hadoop and Apache Spark, a distributed file system, a multi-model database management system, and event stream processing, combining analytics in real-time with operational applications. Its technology runs on both commodity hardware and public cloud computing services. MapR was picked by Amazon to supply an improved variant of Amazon’s Elastic Map Reduce (EMR) facility MapR has as well been picked by Google as a technics collaborator. MapR was capable to split the minute type pace record onto Google’s calculate program. "MapR delivers 3 adaptations of their article familiar like M3, M5 and M7. M3 is a gratis variant of the M5 arti...

Top requirements for successful MapReduce jobs

Image
The following techniques are needed to be successful of your map reduce jobs: The mapper must be able to ingest the input and process the input record, sending forward the records that can be passed to the reduce task or to the final output directly, if no reduce step is required. Hadoop-MapReduce The reducer must be able to accept the key and value groups that passed through the mapper, and generate the final output of this MapReduce step. The job must be configured with the location and type of the input data, the mapper class to use, the number of reduce tasks required, and the reducer class and I/O types. The TaskTracker service will actually run your map and reduce tasks, and the JobTracker service will distribute the tasks and their input split to the various trackers. The cluster must be configured with the nodes that will run the TaskTrackers, and with the number of TaskTrackers to run per node. The TaskTrackers need to be configured with the JVM parameters, includ...

Top features of Apache Avro in Hadoop eco-System

Image
Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages. The Hadoop ecosystem includes a new  binary data serialization system  — Avro.  Avro provides: ·       Rich data structures. ·          A compact, fast, binary data format. ·          A container file, to store persistent data. ·          Remote procedure call (RPC). ·         Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. Its functionality is similar to the other marshaling systems such as Thrift, Protocol Buffers, and so on. The main differentiators of Avro...