Working with Complex JSON data structure using PySpark series - 3

--

In this blog we going to discuss about working with json file and removing some unwanted data in json file using pyspark.

The input and output of this solution code is …

INPUT >> OUTPUT

Before diving into the code some key findings wanted to discuss which made me struggle for some good amount of time to see work arounds to make it possible.

Key points:-

  1. Here the json file is having unwanted data like “messages”: [], // blank json”success”: true // boolean value” which we can’t read the json data with multiline option in spark.

2) Have to read it either simply as text file and do some cleansing using UDF.

so before and after cleansing dataset looks like this…

before and after cleansing

3) now this enitre data is in one column and it is of string datatype so we need to transform complete data which is splitted into multiple rows to single row and make it dict datatype.

4) so once the data is converted to dict datatype so we can create a custom schema for the existing dataset anf make it transform to a normalized clean dataframe.

note:- i have shared the data set so more people have chance to do better solution than me… 😃

Let’s dive into code!….

This Code might look Clumsy but it serves the purpose.

Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.

That’s all for now…Happy Learning….

Please clap and Subscribe/follow my profile…Don’t forget to Comment…

--

--

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

Data Engineer @ AWS | SPARK | PYSPARK | SPARK SQL | enthusiast about #DataScience #ML Enthusiastic#NLP#DeepLearning #OpenCV-Face Recognition #ML deployment