Building SchemaValidation Project with Pyspark

--

Mini OOPS Data Integrity/Governance project

In this blog we going to discuss about working with schema validation of Source data using PYSPARK.

This is the mini project determining source schema matching with the metaschema defined by us. Hence schema will be in intact makes no room to further data transformations failure.

This project can be extended further ETL transformations ,other dependency checks like using it for database connections etc.

This project Module can be used for further Data integrity and Data Governance purpose as well. :-)

lets get into the code structure.

Firstly download the project from GIT and save in local.

I hope everyone is having either python or Anaconda installed into the system. Set up environment variables for Python,Spark,Hadoop.

create virtual enivronment or set a new project in pycharm or any other IDE supports Python.exe

Install requirements.txt file.

Note: in my case file path directory is like this
C:\Users\Hp X360\Python_Spark\Schema_building_project
But this will vary with your local path check these things before running the code
C:\Users\<username>\<some folder path>\Schema_building_project

This is the tree structure of the project consisting of different several modules lets get into detailing of this project.

Project structure

Schema_building_project >> Parent directory for the project

  1. src >> root directory for source code
  2. src >> data >> <filename> >> This folder consists source data location path and target location path.
    a) <filename> in my case the data source name is students.
    b) source_data >> source file location
    c) target >> once the file is processed it will write to the target location folder.
  3. metadata >> this folder consists of metaschema of source file.
    a) This file consists of column name ,datatype ,source and target location.
    Note: this file is defined when its a new Source as data only one.Depending on this metaschema our source file will be processed.
  4. utilites >> this folder can consists of several functionalities for that are used for the project building abstractly.
    we shouldn’t place entire project code at a single level.Instead we should split the code which is reusable for the entire project inorder to maintain coding standards.
    a) utilites >> json_read_schema >> this file determines the code to read the json data. This is to read the metaschema for the corresponding source file.
    b) utilites >> utility_reader >> this file consists of Sparksession intialization call. In future we can add other dependency codes here to maitain integrity of Project standards.
  5. src_main >> this is main code where will run the project as main.

Once project is ready run the code from the terminal..and the command should be like this..

spark-submit src/src_main.py — metaConfigFilename=src//metadata//students_schema.json

Command line Execution

Now checks to see how the code works

a) Successful senario where source columns exactly matching with the metaschema .

Successful senario

b) Failure senario where source columns not matching with the metaschema .
here we have extra column from the source that is Salary which not present in our metaschema.

Failure senario

We can have multiple functionalities added to this existing project.

This Code might look Clumsy but it serves the purpose.

Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.

That’s all for now…Happy Learning….

Please clap and Subscribe/follow my profile…Don’t forget to Comment…

--

--

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

Data Engineer @ AWS | SPARK | PYSPARK | SPARK SQL | enthusiast about #DataScience #ML Enthusiastic#NLP#DeepLearning #OpenCV-Face Recognition #ML deployment