Working with Complex JSON data structure using PySpark series - 3

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

2 min readJun 28, 2023

In this blog we going to discuss about working with json file and removing some unwanted data in json file using pyspark.

The input and output of this solution code is …

Before diving into the code some key findings wanted to discuss which made me struggle for some good amount of time to see work arounds to make it possible.

Key points:-

Here the json file is having unwanted data like “messages”: [], // blank json”success”: true // boolean value” which we can’t read the json data with multiline option in spark.

2) Have to read it either simply as text file and do some cleansing using UDF.

so before and after cleansing dataset looks like this…

3) now this enitre data is in one column and it is of string datatype so we need to transform complete data which is splitted into multiple rows to single row and make it dict datatype.

4) so once the data is converted to dict datatype so we can create a custom schema for the existing dataset anf make it transform to a normalized clean dataframe.

note:- i have shared the data set so more people have chance to do better solution than me… 😃

Let’s dive into code!….

This Code might look Clumsy but it serves the purpose.

Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.

That’s all for now…Happy Learning….

Please clap and Subscribe/follow my profile…Don’t forget to Comment…

Working with Complex JSON data structure using PySpark series - 3

Written by Sairamdgr8 -- An Aspiring Full Stack Data Engineer