How to Handle Spaces in PySpark Dataframe Column

- July 30, 2023

In PySpark, you can employ SQL queries by importing your CSV file data to a DataFrame. However, you might face problems when dealing with spaces in column names of the DataFrame. Fortunately, there is a solution available to resolve this issue.

Reading CSV file to Dataframe

Here is the PySpark code for reading CSV files and writing to a DataFrame.

#initiate session
spark = SparkSession.builder \
.appName("PySpark Tutorial") \
.getOrCreate()

#Read CSV file to df dataframe
data_path = '/content/Test1.csv'
df = spark.read.csv(data_path, header=True, inferSchema=True)

#Create a Temporary view for the DataFrame
df2.createOrReplaceTempView("temp_table")

#Read data from the temporary view
spark.sql("select * from temp_table").show()

Output

--------+-----+---------------+---+

+----------+-----+---------------+ | si1 |year1|62.08| 62.4| | si1 |year2|75.94| 76.75| | si2 |year1|68.26| 72.95| | si2 |year2|85.49| 75.8| | si3 |year1|75.08| 79.84| | si3 |year2|54.98| 87.72| | si4 |year1|50.03| 66.85| | si4 |year2|71.26| 69.77| | si5 |year1|52.74| 76.27| | si5 |year2|50.39| 68.58| | si6 |year1|74.86| 60.8| | si6 |year2|58.29| 62.38| | si7 |year1|63.95| 74.51| | si7 |year2|66.69| 56.92| +----------+-----+-------------+

Fix for space in the column name

Suppose the column name "Student ID" contains a space. To prevent errors, you must modify your SQL query.

spark.sql("select `Student ID` as sid from temp_table").show()

Output:

+---+
|sid|
+---+
|si1|
|si1|
|si2|
|si2|
|si3|
|si3|
|si4|
|si4|
|si5|
|si5|
|si6|
|si6|
|si7|
|si7|
+---+

Apache Spark 3 for Data Engineering & Analytics with Python

References

PySpark By Example

Search This Blog

ApplyBigAnalytics

Featured Post

Step-by-Step Guide to Creating an AWS RDS Database Instance

How to Handle Spaces in PySpark Dataframe Column

Reading CSV file to Dataframe

Fix for space in the column name

Comments

Post a Comment

Popular posts from this blog

Step-by-Step Guide to Reading Different Files in Python

SQL Query: 3 Methods for Calculating Cumulative SUM

PowerCurve for Beginners: A Comprehensive Guide