How to read and write Parquet files in PySpark

This recipe helps you read and write Parquet files in PySpark

Recipe Objective - How to read and write Parquet files in PySpark?

Apache Parquet is defined as the columnar file format which provides the optimizations to speed up the queries and is the efficient file format than the CSV or JSON and further supported by various data processing systems. Apache Parquet is compatible with multiple data processing frameworks in Hadoop echo systems. It provides efficiency in the data compression and encoding schemes with the enhanced performance to handle the complex data in bulk. Spark SQL provides support for both the reading and the writing Parquet files which automatically capture the schema of original data, and it also reduces data storage by 75% on average. By default, Apache Spark supports Parquet file format in its library; hence, it doesn't need to add any dependency libraries. Apache Parquet reduces input and output operations and consumes less space.

Hadoop vs Spark - Find Out Now Who is The Big Winner in the Big Data World

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains Parquet file format and Parquet file format advantages & reading and writing data as dataframe into parquet file form in PySpark.

Implementing reading and writing into Parquet file format in PySpark in Databricks

# Importing packages import pyspark from pyspark.sql import SparkSession

Databricks-1

The PySpark SQL package is imported into the environment to read and write data as a dataframe into Parquet file format in PySpark.

# Implementing Parquet file format in PySpark spark=SparkSession.builder.appName("PySpark Read Parquet").getOrCreate() Sampledata =[("Ram ","","sharma","36636","M",4000), ("Shyam ","Aggarwal","","40288","M",5000), ("Tushar ","","Garg","42114","M",5000), ("Sarita ","Kumar","Jain","39192","F",5000), ("Simran","Gupta","Brown","","F",-2)] Samplecolumns=["firstname","middlename","lastname","dob","gender","salary"] # Creating dataframe dataframe = spark.createDataFrame(Sampledata,Samplecolumns) # Reading parquet dataframe ParDataFrame1 = spark.read.parquet("/tmp/output/Samplepeople.parquet") ParDataFrame1.createOrReplaceTempView("ParquetTable") ParDataFrame1.printSchema() ParDataFrame1.show(truncate = False) # Writing dataframe as a Parquet file dataframe.write.mode("overwrite").parquet("/tmp/output/Samplepeople.parquet")

Databricks-2

Databricks-3

The "Sampledata" value is defined with sample values input. The "Samplecolumns" is defined with sample values to be used as a column in the dataframe. Further, the "dataframe" value creates a data frame with columns "firstname", "middlename", "lastname", "dob", "gender" and "salary". Further, the parquet dataframe is read using "spark.read.parquet()" function. Finally, the parquet file is written using "dataframe.write.mode().parquet()" selecting "overwrite" as the mode.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Learn to Create Delta Live Tables in Azure Databricks
In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Build a real-time Streaming Data Pipeline using Flink and Kinesis
In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics