Acing Apache Spark Senario-based Question Series-4 using PySpark Dataframes
Understanding the famous Employee analytical Questions on SQL to be derived in pythonic way using PySpark
In this scenario, we are going to discuss highest salary of employee in each department, second highest salary, overall highest salary , and with some constraints in using PySpark.
Questions:-
Have dataframe (df) with below columns Empid, EmpName, Salary , Dept
- Find the highest salary among all employees
- Find the Highest Salary without using aggregations and group by clause
- Find the Dept which has highest Salary
- Find the Second highest salary by Dept
Simply will try to decipt the Input and required Output dataframes:-
Lets check the code …
This Code might look Clumsy but it serves the purpose.
Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.
That’s all for now…Happy Learning….
Please do clap and Subscribe to my profile…Don’t forget to Comment…