Acing Apache Spark RDD Interview Questions Series-1 using PySpark

In this blog we are going to discuss Apache spark RDD interview Senario using PySpark.

Before discussing about the interview question let us have some basic understanding of RDD.

RDD is an core component API of Apache Spark.Before the arrival of Spark2x version with DataFrames API. Apache Spark worked on RDDs.

Even though we work on DataFrames API of Spark but the underlying wrapper is RDD only.

What is RDD ?

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDD’s can contain any type of Python, Java, orScala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDD’s can be created through deterministic operations on either data on stable storage or other RDD’s. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

Types of RDD ?

ParallelCollectionRDD , CoGroupedRDD , HadoopRDD , MapPartitionsRDD , CoalescedRDD , ShuffledRDD , PipedRDD , PairRDD , DoubleRDD , SequenceFileRDD

Advantages RDD?

In-Memory ,Lazy Evaluations ,Immutable and Read-only , Cacheable or Persistence, Partitioned , Parallel , Fault Tolerance ,Location Stickiness ,Typed

Limitations?

There is no specific number that limits the usage of RDD. We can use as much RDDs we require. It totally depends on the size of its memory or disk.

But ideal senarios RDDs are not preferable when usage on high volumes of Data.

Interview based Senario Question:

  1. Read the input testfile (Pipedelimited)provided as a”SparkRDD”
  2. RemovetheHeaderRecordfromtheRDD
  3. CalculateFinal_Price:
    Final_Price=(Size*Price_SQ_FT)
  4. Save the final rddasTextfile with three fields.

Lets jump into the coding part quickly…

Some interesting finds i found while coding…

Both output of a list type and RDD type are same but their functionality is different .. we need to be careful while doing some join operations on rdds.

ie, list datatype can’t be combined with RDD datatype even though both are result sets are similar

That it all for now…

I hope now you have a basic idea about the RDD’s and their role in Apache Spark.

Thanks for reading, cheers!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store