Posts

Showing posts with the label legacy

Featured Post

14 Top Data Pipeline Key Terms Explained

Image
 Here are some key terms commonly used in data pipelines 1. Data Sources Definition: Points where data originates (e.g., databases, APIs, files, IoT devices). Examples: Relational databases (PostgreSQL, MySQL), APIs, cloud storage (S3), streaming data (Kafka), and on-premise systems. 2. Data Ingestion Definition: The process of importing or collecting raw data from various sources into a system for processing or storage. Methods: Batch ingestion, real-time/streaming ingestion. 3. Data Transformation Definition: Modifying, cleaning, or enriching data to make it usable for analysis or storage. Examples: Data cleaning (removing duplicates, fixing missing values). Data enrichment (joining with other data sources). ETL (Extract, Transform, Load). ELT (Extract, Load, Transform). 4. Data Storage Definition: Locations where data is stored after ingestion and transformation. Types: Data Lakes: Store raw, unstructured, or semi-structured data (e.g., S3, Azure Data Lake). Data Warehous...

How Hadoop is Better for Legacy data

Image
Here is an interview question on legacy data. You all know that a lot of data is available on legacy systems. You can use Hadoop to process the data for useful insights. 1. How should we be thinking about migrating data from legacy systems? Treat legacy data as you would any other complex data type.  HDFS acts as an active archive, enabling you to cost-effectively store data in any form for as long as you like and access it when you wish to explore the data. And with the latest generation of data wrangling and ETL tools, you can transform, enrich, and blend that legacy data with other, newer data types to gain a unique perspective on what’s happening across your business. 2. What are your thoughts on getting combined insights from the existing data warehouse and Hadoop? Typically one of the starter use cases for moving relational data off a warehouse and into Hadoop is active archiving.  This is the opportunity to take data that might have otherwise gone to the archive and k...