How to read and write Parquet files in PySpark

This recipe helps you read and write Parquet files in PySpark
Last Updated: 19 Jan 2023

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - How to read and write Parquet files in PySpark?

Apache Parquet is defined as the columnar file format which provides the optimizations to speed up the queries and is the efficient file format than the CSV or JSON and further supported by various data processing systems. Apache Parquet is compatible with multiple data processing frameworks in Hadoop echo systems. It provides efficiency in the data compression and encoding schemes with the enhanced performance to handle the complex data in bulk. Spark SQL provides support for both the reading and the writing Parquet files which automatically capture the schema of original data, and it also reduces data storage by 75% on average. By default, Apache Spark supports Parquet file format in its library; hence, it doesn't need to add any dependency libraries. Apache Parquet reduces input and output operations and consumes less space.

Hadoop vs Spark - Find Out Now Who is The Big Winner in the Big Data World

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains Parquet file format and Parquet file format advantages & reading and writing data as dataframe into parquet file form in PySpark.

Implementing reading and writing into Parquet file format in PySpark in Databricks

# Importing packages import pyspark from pyspark.sql import SparkSession

Databricks-1

The PySpark SQL package is imported into the environment to read and write data as a dataframe into Parquet file format in PySpark.

# Implementing Parquet file format in PySpark spark=SparkSession.builder.appName("PySpark Read Parquet").getOrCreate() Sampledata =[("Ram ","","sharma","36636","M",4000), ("Shyam ","Aggarwal","","40288","M",5000), ("Tushar ","","Garg","42114","M",5000), ("Sarita ","Kumar","Jain","39192","F",5000), ("Simran","Gupta","Brown","","F",-2)] Samplecolumns=["firstname","middlename","lastname","dob","gender","salary"] # Creating dataframe dataframe = spark.createDataFrame(Sampledata,Samplecolumns) # Reading parquet dataframe ParDataFrame1 = spark.read.parquet("/tmp/output/Samplepeople.parquet") ParDataFrame1.createOrReplaceTempView("ParquetTable") ParDataFrame1.printSchema() ParDataFrame1.show(truncate = False) # Writing dataframe as a Parquet file dataframe.write.mode("overwrite").parquet("/tmp/output/Samplepeople.parquet")

Databricks-2

Databricks-3

The "Sampledata" value is defined with sample values input. The "Samplecolumns" is defined with sample values to be used as a column in the dataframe. Further, the "dataframe" value creates a data frame with columns "firstname", "middlename", "lastname", "dob", "gender" and "salary". Further, the parquet dataframe is read using "spark.read.parquet()" function. Finally, the parquet file is written using "dataframe.write.mode().parquet()" selecting "overwrite" as the mode.

Download Materials

Databricks_1

Databricks_2

Databricks_3

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More