Monday 23 October 2017

Difference between DataFrame and DataSet in Spark 2.0

Purpose of this this Blog is to list out maximum difference between DF and DS.

From spark version 2.0,Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.Dataset is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.

DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.

DataFrame -> DataSet[Row] - here Row is a untyped generic JVM object, faster and suitable for interactive analysis.
DataSet[T] -> typed API is optimized for data engineering tasks.

Static-typing and runtime type-safety
DataSet, most restrictive. Since Dataset APIs are all expressed as lambda functions and JVM typed objects, any mismatch of typed-parameters will be detected at compile time. Also, your analysis error can be detected at compile time too, when using Datasets, hence saving developer-time and costs.
If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.

DataFrame disadvantage over DataSet: Lack of Type Safety. As a developer, i will not like using DataFrame as it doesn't seem developer friendly. Referring attribute by String names means no compile time safety. Things can fail at runtime. Also APIs doesn't look programmatic and more of SQL kind.

Advantage of DataSet over DataFrame:
It has an additional feature : Encoders
Encoders act as interface between JVM objects and off-heap custom memory binary format data.
Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object.
case class is used to define the structure of data schema in DataSet. Using case class, its very easy to work with DataSet. Names of different attributes in case class is directly mapped to field names in DataSet. It gives feeling like working with RDD but actually underneath it works same as DataFrame.

I hope this would be helpful. If any of you have any additional point please comment.


No comments:

Post a Comment

Python and packages offline Installation in ubuntu machine.

https://www.linkedin.com/pulse/python-packages-offline-installation-ubuntu-machine-sijo-jose/