Spark Adaptive Query Execution- Performance Optimization using PySpark

AQE with Spark 3x

Spark SQL is one of the important components of Apache Spark. It powers both SQL queries and the DataFrame API. At its core, the Catalyst optimizer.

Catalyst Optimizer improves developer productivity and the performance of their written queries. Catalyst automatically transforms relational queries to execute them more efficiently using techniques such as filtering, indexes and ensuring that data source joins are performed in the most efficient order.

But in Apache Spark 2x the once the Execution plan with SparkSQL joins is created on the certain data calculations it can’t adapt the data speculations which comes in future. Catalyst optimizer is not that intelligent to adapt such scenarios.

Volume of data for ingestion or migration may not be same all time.

The default number of shuffle partition in Spark is 200 which causes a lot of pain during development activity. If the number is set to a small value it may lead to disk spills and if the number is set to a higher value it leads to increase in I/O

To increase the performance of the Spark to create a Execution plan Dynamically and adaption different SQL join strategies…

Spark3x version came into picture with configuration Adaptive Query Execution (AQE).

Spark3xwill send statistics about the real size of the data in the shuffle files so that for the next stage, it re-optimizes the logical plan to dynamically

1)switch join strategies

2) coalesce the number of shuffle partitions

3) optimize skew joins.

Note: Spark3x will come with Adaptive query execution configuration by default. So in order to check the difference before and after using Spark 2x -Spark 3x cluster Please disable the AQE and do the next steps….

Don’t worry we will show all the details in the given code below..

Lets dive into some practical demo where this is exactly working..

Number of spark jobs, number of partitions, and execution plan before AQE.

Number of spark jobs, number of partitions, and execution plan before AQE

Execution Timing :-

Before enabling AQE
After Enabling AQE

That’s all…I hope this blog was a help for you in understanding the Adaptive Query Execution. Thank you

--

--

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

Data Engineer @ AWS | SPARK | PYSPARK | SPARK SQL | enthusiast about #DataScience #ML Enthusiastic#NLP#DeepLearning #OpenCV-Face Recognition #ML deployment