Featured Post

Python Set Operations Explained: From Theory to Real-Time Applications

Image
A  set  in Python is an unordered collection of unique elements. It is useful when storing distinct values and performing operations like union, intersection, or difference. Real-Time Example: Removing Duplicate Customer Emails in a Marketing Campaign Imagine you are working on an email marketing campaign for your company. You have a list of customer emails, but some are duplicated. Using a set , you can remove duplicates efficiently before sending emails. Code Example: # List of customer emails (some duplicates) customer_emails = [ "alice@example.com" , "bob@example.com" , "charlie@example.com" , "alice@example.com" , "david@example.com" , "bob@example.com" ] # Convert list to a set to remove duplicates unique_emails = set (customer_emails) # Convert back to a list (if needed) unique_email_list = list (unique_emails) # Print the unique emails print ( "Unique customer emails:" , unique_email_list) Ou...

Top Hadoop Architecture Interview Questions

The hadoop.apache.org web site defines Hadoop as "a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models." Quite simply, that's the philosophy: to provide a framework that's simple to use, can be scaled easily, and provides fault tolerance and high availability for production usage.

The idea is to use existing low-cost hardware to build a powerful system that can process petabytes of data very efficiently and quickly.

More : Top selected Hadoop Interview Questions

Hadoop achieves this by storing the data locally on its DataNodes and processing it locally as well. All this is managed efficiently by the NameNode, which is the brain of the Hadoop system. All client applications read/write data through NameNode.

Hadoop has two main components: the Hadoop Distributed File System (HDFS) and a framework for processing large amounts of data in parallel using the MapReduce paradigm

HDFS


HDFS is a distributed file system layer that sits on top of the native file system for an operating system. For example, HDFS can be installed on top of ext3, ext4, or XFS file systems for the Ubuntu operating system.

It provides redundant storage for massive amounts of data using cheap, unreliable hardware. At load time, data is distributed across all the nodes. That helps in efficient MapReduce processing. HDFS performs better with a few large files (multi-gigabytes) as compared to a large number of small files, due to the way it is designed.

Files are "write once, read multiple times." Append support is now available for files with the new version, but HDFS is meant for large, streaming reads—not random access. High sustained throughput is favored over low latency.

Files in HDFS are stored as blocks and replicated for redundancy or reliability. By default, blocks are replicated thrice across DataNodes; so three copies of every file are maintained. Also, the block size is much larger than other file systems. For example, NTFS (for Windows) has a maximum block size of 4KB and Linux ext3 has a default of 4KB.

Compare that with the default block size of 64MB that HDFS uses.


Name Node



NameNode (or the "brain") stores metadata and coordinates access to HDFS. Metadata is stored in NameNode's RAM for speedy retrieval and reduces the response time (for NameNode) while providing addresses of data blocks. 

This configuration provides simple, centralized management—and also a single point of failure (SPOF) for HDFS. In previous versions, a Secondary NameNode provided recovery from NameNode failure; but current version provides capability to cluster a Hot Standby (where the standby node takes over all the functions of NameNode without any user intervention) node in Active/Passive configuration to eliminate the SPOF with NameNode and provides NameNode redundancy.

Since the metadata is stored in NameNode's RAM and each entry for a file (with its block locations) takes some space, a large number of small files will result in a lot of entries and take up more RAM than a small number of entries for large files.

Also, files smaller than the block size (smallest block size is 64 MB) will still be mapped to a single block, reserving space they don't need; that's the reason it's preferable to use HDFS for large files instead of small files.

Comments

Popular posts from this blog

SQL Query: 3 Methods for Calculating Cumulative SUM

Big Data: Top Cloud Computing Interview Questions (1 of 4)

5 SQL Queries That Popularly Used in Data Analysis