Acing Apache Spark Senario-based Question Series-4 using PySpark Dataframes

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

2 min readMar 11, 2023

Understanding the famous Employee analytical Questions on SQL to be derived in pythonic way using PySpark

In this scenario, we are going to discuss highest salary of employee in each department, second highest salary, overall highest salary , and with some constraints in using PySpark.

Questions:-

Have dataframe (df) with below columns Empid, EmpName, Salary , Dept

Find the highest salary among all employees
Find the Highest Salary without using aggregations and group by clause
Find the Dept which has highest Salary
Find the Second highest salary by Dept

Simply will try to decipt the Input and required Output dataframes:-

Lets check the code …

This Code might look Clumsy but it serves the purpose.

Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.

That’s all for now…Happy Learning….

Please do clap and Subscribe to my profile…Don’t forget to Comment…

Acing Apache Spark Senario-based Question Series-4 using PySpark Dataframes

Understanding the famous Employee analytical Questions on SQL to be derived in pythonic way using PySpark

Written by Sairamdgr8 -- An Aspiring Full Stack Data Engineer

No responses yet