Featured Post

14 Top Data Pipeline Key Terms Explained

Image
 Here are some key terms commonly used in data pipelines 1. Data Sources Definition: Points where data originates (e.g., databases, APIs, files, IoT devices). Examples: Relational databases (PostgreSQL, MySQL), APIs, cloud storage (S3), streaming data (Kafka), and on-premise systems. 2. Data Ingestion Definition: The process of importing or collecting raw data from various sources into a system for processing or storage. Methods: Batch ingestion, real-time/streaming ingestion. 3. Data Transformation Definition: Modifying, cleaning, or enriching data to make it usable for analysis or storage. Examples: Data cleaning (removing duplicates, fixing missing values). Data enrichment (joining with other data sources). ETL (Extract, Transform, Load). ELT (Extract, Load, Transform). 4. Data Storage Definition: Locations where data is stored after ingestion and transformation. Types: Data Lakes: Store raw, unstructured, or semi-structured data (e.g., S3, Azure Data Lake). Data Warehous...

Python: How to Work With Various File Formats

Here is Python logic that shows Parse and Read Different Files in Python. The formats are XML, JSON, CSV, Excel, Text, PDF, Zip files, Images, SQLlite, and Yaml.

Parse and Read Different Files in Python

Python Reading Files


import pandas as pd
import json
import xml.etree.ElementTree as ET
from PIL import Image
import pytesseract
import PyPDF2
from zipfile import ZipFile
import sqlite3
import yaml

Reading Text Files


# Read text file (.txt)
def read_text_file(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
    return text

Reading CSV Files


# Read CSV file (.csv)
def read_csv_file(file_path):
    df = pd.read_csv(file_path)
    return df


Reading JSON Files


# Read JSON file (.json)
def read_json_file(file_path):
    with open(file_path, 'r') as file:
        json_data = json.load(file)
    return json_data

Reading Excel Files


# Read Excel file (.xlsx, .xls)
def read_excel_file(file_path):
    df = pd.read_excel(file_path)
    return df

Reading PDF files


# Read PDF file (.pdf)
def read_pdf_file(file_path):
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text


Reading XML Files


# Read XML file (.xml)
def read_xml_file(file_path):
    tree = ET.parse(file_path)
    root = tree.getroot()
    return root


Reading Image Files


# Read image file (.jpg, .png, etc.)
def read_image_file(file_path):
    image = Image.open(file_path)
    text = pytesseract.image_to_string(image)
    return text

Reading Zip Files


# Read compressed file (.zip, .tar.gz, etc.)
def read_compressed_file(file_path):
    with ZipFile(file_path, 'r') as zip_file:
        files = zip_file.namelist()
    return files


Reading SQLLite


# Read SQLite database file (.db)
def read_sqlite_file(file_path):
    conn = sqlite3.connect(file_path)
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM table_name")
    data = cursor.fetchall()
    return data

Reading YAML Files


# Read YAML file (.yaml)
def read_yaml_file(file_path):
    with open(file_path, 'r') as file:
        yaml_data = yaml.load(file, Loader=yaml.SafeLoader)
    return yaml_data

# Usage examples
txt_file = "/path/to/text/file.txt"
txt_data = read_text_file(txt_file)

csv_file = "/path/to/csv/file.csv"
csv_dataframe = read_csv_file(csv_file)

json_file = "/path/to/json/file.json"
json_data = read_json_file(json_file)

excel_file = "/path/to/excel/file.xlsx"
excel_dataframe = read_excel_file(excel_file)

pdf_file = "/path/to/pdf/file.pdf"
pdf_text = read_pdf_file(pdf_file)

xml_file = "/path/to/xml/file.xml"
xml_data = read_xml_file(xml_file)

image_file = "/path/to/image/file.jpg"
image_text = read_image_file(image_file)

zip_file = "/path/to/compressed/file.zip"
compressed_files = read_compressed_file(zip_file)

sqlite_file = "/path/to/sqlite/file.db"
sqlite_data = read_sqlite_file(sqlite_file)

yaml_file = "/path/to/yaml/file.yaml"
yaml_data = read_yaml_file(yaml_file)


Note that some functionalities, like reading images or extracting data from an SQLite database, may require additional libraries to be installed, such as pytesseract for image processing and SQLite3 for database manipulation. Make sure you have those libraries installed before running the code.

Conclusion


In conclusion, the ability to read different file formats is a crucial skill in Python programming, enabling developers to handle a diverse range of data sources.

Comments

Popular posts from this blog

How to Fix datetime Import Error in Python Quickly

SQL Query: 3 Methods for Calculating Cumulative SUM

Big Data: Top Cloud Computing Interview Questions (1 of 4)