Working with JSON data structure using PySpark series -4

--

In this blog we going to discuss about working with json file normalizing from json to structured row -columnar format using PYSPARK.

The input and output of this solution code is …

Input and Final output Dataframe.

Before diving into the code some key findings wanted to discuss .

  1. first thing always when you try to import any unstructured data first do understand the structre of the datatype of data which is residing.
    Using printSchema method we can know the schema datatypes.
  2. Here json data when each record is ihaving seperate block can be normalized using multiline option = TRUE using pyspark.
  3. When values under the root we can traverse the data points using dot “.” operator.
  4. In Json data there me Array data which needs to normalized further using getItem we can traverse to excat data.

Note:- i have shared the data set so more people have chance to do better solution than me… 😃

Let’s dive into code!….

This Code might look Clumsy but it serves the purpose.

Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.

That’s all for now…Happy Learning….

Please clap and Subscribe/follow my profile…Don’t forget to Comment…

--

--

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

Data Engineer @ AWS | SPARK | PYSPARK | SPARK SQL | enthusiast about #DataScience #ML Enthusiastic#NLP#DeepLearning #OpenCV-Face Recognition #ML deployment