Drive Pyspark shell scripts for Cloud agnostics data pipelines

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

4 min readDec 25, 2023

Pyspark shell scripts for Cloud agnostics data pipelines

In this blog we going to discuss about working with shell scripts for pyspark codes.

This is similar to the last project Schema building Project (link given below) a mini ETL project describing about the one layer data is taken as a source for another layer for data transformations.

Schema building Project code:-

Building SchemaValidation Project with Pyspark

In this blog we going to discuss about working with schema validation of Source data using PYSPARK.

sairamdgr8.medium.com

In this project we have two modules one is src_layer (Source layer )and another is etl_layer (Extract_Transform_load layer ).

a) src layer : This layer consists of multiple supporting files for source layer which reads the data and writes to target location.

b) etl layer : This layer consists src layer target data as a input to etl layer.In this layer we are inducing multiple data transformations ..I have given only two of them ..but can be extended further.

Now we understood that etl layer has dependency with src layer so it means we need to run the code sequentially one after another.

if src layer fails then there is no use of etl layer to run. So running the code manually one after another to check the stats is fine. But in real time we need to have a script to execute this functionality…

So having Shell scripts running the spark-submit commands one after another will help to run the code sequentially.

Cloud Agnostic data pipeline Concept:- It means all organization using enterprise clouds such as AWS,AZURE,GCP etc..use pipelining conepts which runs spark codes sequentially on after another ..sometimes it makes the pipeline more complex and fuzzy…

Hence Optimizing the data pipelines using the shell script concept will help you to ease the read of pipelines.

Here i tried to showcase the concept of implementing for Cloud Agnostic data pipeline as a Data Flow diagram.

In this data flow diagram i’m deploying the shell script code into the local system using WSL(Windows subsytem for linux).

Project code structure:-

here in this project structure we have setup.sh and runStudents.sh

a) setup.sh:- This script will help to setup the Spark dependencies for the local system

b) runStudents.sh :- This script will run the actual spark-submit commands for the code to execute the source and etl layers sequentially.

i have given the code reference for this project. Kindly check for the windows user path and system variables.

GitHub - sairamdgr8/pyspark_shell_scripting_for_cloud_data_pipelines: In this blog we going to…

In this blog we going to discuss about working with shell scripts for pyspark codes. - GitHub …

github.com

Before running the shell scripts in the windows kindly have WSL installed into the system.

once installed …download the code from the git and go to the respective location and start running the shell script.

a) setup.sh file..

b) spark-submit code for particular enitity:-

lets check the shell script executing snippets..

We can have multiple functionalities added to this existing project.

This Code might look Clumsy but it serves the purpose.

Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.

That’s all for now…Happy Learning….

Please clap and Subscribe/follow my profile…Don’t forget to Comment…