Fun to code.: October 2017

Monday, 23 October 2017

Difference between DataFrame and DataSet in Spark 2.0

Purpose of this this Blog is to list out maximum difference between DF and DS.

From spark version 2.0,Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.Dataset is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.

DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.

DataFrame -> DataSet[Row] - here Row is a untyped generic JVM object, faster and suitable for interactive analysis.
DataSet[T] -> typed API is optimized for data engineering tasks.

Static-typing and runtime type-safety
DataSet, most restrictive. Since Dataset APIs are all expressed as lambda functions and JVM typed objects, any mismatch of typed-parameters will be detected at compile time. Also, your analysis error can be detected at compile time too, when using Datasets, hence saving developer-time and costs.
If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.

DataFrame disadvantage over DataSet: Lack of Type Safety. As a developer, i will not like using DataFrame as it doesn't seem developer friendly. Referring attribute by String names means no compile time safety. Things can fail at runtime. Also APIs doesn't look programmatic and more of SQL kind.

Advantage of DataSet over DataFrame:
It has an additional feature : Encoders
Encoders act as interface between JVM objects and off-heap custom memory binary format data.
Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object.
case class is used to define the structure of data schema in DataSet. Using case class, its very easy to work with DataSet. Names of different attributes in case class is directly mapped to field names in DataSet. It gives feeling like working with RDD but actually underneath it works same as DataFrame.

I hope this would be helpful. If any of you have any additional point please comment.

Thursday, 12 October 2017

Apache Ambari 2.5.0.3 installation on AWS RHEL7.

Purpose
Purpose of this blog is to explain how to install apache Ambari server and how to configure, install the services. Apache Ambari is a monitoring and management tool for Hadoop ecosystem. Through which, we can install Hadoop and spark cluster.

Prerequisites
You should have at least two AWS machines with its internal ip and hostname as below, and we have to pick one machine as master for installing ambari server. Which must have a public ip.
Operating System: Red hat enterprise linux 7

111.22.3.44 ip-111-22-2-44.ap-south-1.compute.internal
111.22.33.444 ip-111-22-33-444.ap-south-1.compute.internal

Step 1 : Install all these softwares in both the machines.
sudo yum install wget
sudo yum install rpm
sudo yum install nano
sudo yum install curl

Step 2 : Configure passwordless SSH between thsese machines, follow below steps.
copy .pem file to .ssh folder of your user in master.
change permission of file using below command.
chmod 400 awskeyfile.pem
Copy .pem file to second machine as well.
scp -i ~/.ssh/aws-key-file.pem ~/.ssh/aws-key-file.pem ec2-user@111.22.33.444:~/.ssh/
Move to second machine run below commands:
ssh -i ~/.ssh/aws-key-file.pem ec2-user@111.22.33.444
cd .ssh
mv aws-key-file.pem id_rsa
chmod 400 id_rsa
Logout from the second machine and execute below command.
cd .ssh
mv aws-key-file.pem id_rsa
chmod 400 id_rsa
By now you should be able to login without password using the below command.
ssh 111.22.33.444

Step 3 : Copy these details based on your systems private ip to hosts file in both machines:
sudo nano etc/hosts
111.22.3.44 ip-111-22-2-44.ap-south-1.compute.internal
111.22.33.444 ip-111-22-33-444.ap-south-1.compute.internal

Step 4 : Update umask using below command:
Open profile file:
sudo nano etc/profie
Set umask as below, remove other options in if else statement.
set umask 022
umask 022

Step 5 : Disable selinux using below command:
Open file and update to "disabled" instead of "enforcing", this change will only take effect after the reboot.
sudo nano etc/sysconfig/selinux
sudo nano etc/selinux/config

sudo getenforce – for checking the selinux status after restart.
Disabled

Step 6 : Install and disable firewalld.

sudo yum install firewalld
sudo service firewalld stop
sudo systemctl disable firewalld
sudo systemctl status firewalld

Step 7 : Enable ntpd
sudo yum install ntp
sudo systemctl enable ntpd
sudo systemctl start ntpd
sudo systemctl is-enabled ntpd

Step 8 : Reboot Machines
sudo reboot

Starting installation
Install below command in both the machines to avoid a missing library issue in RHEL 7 repo for ambary
sudo yum-config-manager --enable rhui-REGION-rhel-server-optional
here you can find the issue details. : https://community.hortonworks.com/questions/96763/hdp-26-ambari-install-fails-on-rhel-7-on-libtirpc.html

Adding ambari public repo
sudo wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.5.0.3/ambari.repo -O /etc/yum.repos.d/ambari.repo

Install ambari server
sudo yum install ambari-server

Ambari server setup
sudo ambari-server setup
accept all default options

Start Ambari server.
sudo ambari-server start

Installing, Configuring, and Deploying a HDP Cluster
Access ambari server using browser on ambary server host and port 8080.
configure, install and deploy HDP in using ambary by following below link.
https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.1/bk_Installing_HDP_AMB/content/ch_Deploy_and_Configure_a_HDP_Cluster.html

Hope this would help.
Cheers.

Monday, 2 October 2017

Spark streaming with Kafka - SubscribePattern Example

Please find below spark kafka streaming example - SubscribePattern
Using SubscribePattern you can subscribe to topics using regex. In this example I have specified regex 'topic.*'. So this program will subscribe from all the topics which would fall into this regex. Example: topicA, topicB etc

Spark kafka streaming example

Please find below spark kafka streaming example:

Fun to code.