Featured Post

Top Questions People Ask About Pandas, NumPy, Matplotlib & Scikit-learn — Answered!

Image
 Whether you're a beginner or brushing up on your skills, these are the real-world questions Python learners ask most about key libraries in data science. Let’s dive in! 🐍 🐼 Pandas: Data Manipulation Made Easy 1. How do I handle missing data in a DataFrame? df.fillna( 0 ) # Replace NaNs with 0 df.dropna() # Remove rows with NaNs df.isna(). sum () # Count missing values per column 2. How can I merge or join two DataFrames? pd.merge(df1, df2, on= 'id' , how= 'inner' ) # inner, left, right, outer 3. What is the difference between loc[] and iloc[] ? loc[] uses labels (e.g., column names) iloc[] uses integer positions df.loc[ 0 , 'name' ] # label-based df.iloc[ 0 , 1 ] # index-based 4. How do I group data and perform aggregation? df.groupby( 'category' )[ 'sales' ]. sum () 5. How can I convert a column to datetime format? df[ 'date' ] = pd.to_datetime(df[ 'date' ]) ...

10 Excusive Steps You need for Web Scrapping

Here're ten Python technics to clean the scraped data. The scraped Text has unwanted hidden data. So, as part of cleaning it try to remove these ten in your data.

10 Steps for Web scrapping

Data is prime input for text analytics projects. After cleaning, you can feed to Machine/Deep Learning systems.
  1. Removing HTML tags
  2. Tokenization
  3. Removing unnecessary tokens and stop-words
  4. Handling contractions
  5. Correcting spelling errors
  6. Stemming
  7. Lemmatization
  8. Tagging
  9. Chunking
  10. Parsing

10 Technics to Clean Text in Python
10 Technics to Clean Text in Python


1. Removing HTML tags

The unstructured text contains a lot of noise ( data from web pages, blogs, and online repositories.)when you use web/screen scraping. 

The HTML tags, JavaScript, and Iframe tags typically don't add much value to understanding and analyzing text. Our purpose is to remove HTML tags, and other noise.


2. Tokenization

  • Tokens are independent and minimal textual components. And have a definite syntax and semantics. A paragraph of text or a text document has several elements. Those you can further break down into clauses, phrases, and words. 
  • The popular tokenization techniques include sentence and word tokenization. These, you can use to break down a text document (or corpus) into sentences. And each sentence into words. 
  • Thus, tokenization is the process of breaking down or splitting textual data into smaller and more meaningful components called tokens.


Python is popular in Text analytics. Here, you will find various cleaning techniques used in text analytics.
Text Analytics in Python 


3. Removing Unnecessary Stop Words

Stopwords have little or no significance and are usually removed from a text when processing it. These usually occur most frequently if you aggregate a corpus of text based on the particular tokens and their frequencies. Words like "a," "the," "and," and so on are stopwords.


4. Handling contractions

The best examples of contractions are you'll, it's, etc.

5. Correcting spelling errors

Auto correcting spelling errors. While doing a Google search, you will find it corrects your spelling automatically.

6. Stemming

Here you can reduce words to the root level. The best example is Snowball, this you stem it to root level as Snow and Ball.


7. Lemmatization

Based on the context, bring the words to the root level, and make them meaningful.


8. Tagging

This is the concept of group particular words under a Tag.


9. Chunking

It is of constructing from various words of Verbs, Nouns, Adjectives, etc. Check out here on Data Chunking.


10. Parsing

The data will pass through some syntax rules. The output will then feed to Machine Learning systems. The syntax rules vary from project to project.

Comments

Popular posts from this blog

SQL Query: 3 Methods for Calculating Cumulative SUM

5 SQL Queries That Popularly Used in Data Analysis

Big Data: Top Cloud Computing Interview Questions (1 of 4)