Acing Apache Spark DataFrame Interview Questions Series-2 using PySpark with UDF

Working with Apache Spark User Defined Functions(UDF)

Introduction:-

In this article we are going to discuss about usage of UDF in Apache Spark Dataframe.

Before getting into Action Lets have some theory..

UDF’s:-

UDF’s are used to extend the functions of the framework and re-use this function on several DataFrame. For example if you wanted to convert the every first letter of a word in a sentence to capital case, spark build-in features does’t have this function hence you can create it as UDF and reuse this as needed on many Data Frames. UDF’s are once created they can be re-use on several DataFrame’s and SQL expressions.

Before you create any UDF, do proper search to check if the similar function you wanted is already available in Spark Inbuilt Functions.

Spark Inbuilt Functions provides several predefined common functions and many more new functions are added with every release. hence, It is best to check before you reinventing the wheel.

lets get into the topic…

Senario is like we have a Dataframe which each person is having their performance rating ie, performanceidx and their respective yearly salaries. And we have another reference dataframe which tells how much salary hike need to be given to each employee so that to calculate their respective hiked salaries for the year.

Simply will try to decipt the Input and required Output dataframe

Input df + reference df=output df

Let's do this using UDF …

This Code might look Clumsy but serves the purpose.

Note:- If anyone has a better approach to generalizing this code happy to embed it in my script.

That’s all for now…Happy Learning….

Please do clap and Subscribe to my profile…Don't forget to Comment…

--

--

Sairamdgr8 -- An Aspiring Full Stack Data Engineer

Data Engineer @ AWS | SPARK | PYSPARK | SPARK SQL | enthusiast about #DataScience #ML Enthusiastic#NLP#DeepLearning #OpenCV-Face Recognition #ML deployment