Fun to code.
Tuesday, 21 August 2018
Friday, 1 June 2018
Spark streaming back pressure - To avoid trade off between Spark Streaming batch Interval and batch processing time.
In a spark streaming application batches of data should be processed as they are being generated. Ideally, batch processing time should be close to the batch interval time, but consistently below it. This is one of the major requirement of a production-ready spark streaming application.
If the batch processing time is more than the batch interval, data is going to pile up in the system and eventually will run out of resources. If the batch interval is way less than batch processing time, it is a waste cluster resources.
According to spark documentation:
"A good approach to figuring out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate".
However, this has its own drawback. For long-running spark application that runs in production for months, things change over time. Sometimes the message characteristics such as message size change over time, causing the processing time of the same number of messages varies. Sometimes a multi-tenant cluster becomes busy during the daytime when other big data applications such as Impala/Hive/MR jobs compete for shared system resources such as CPU/Memory/Network/Disk IO.
The backpressure comes to the rescue! Backpressure was a highly demanded feature that allows the ingestion rate to be set dynamically and automatically, basing on previous micro-batch processing time. Such feedback loop makes it possible to adapt to the fluctuation nature of the streaming application.
This backpressure can be enabled by setting the configuration parameter :
sparkConf.set("spark.streaming.backpressure.enabled",”true”)
What about the first micro-batch rate? Since there is no previous micro-batch processing time available, there is no basis to estimate what is the optimal rate the application should use. According to this ticket,(SPARK-18580), this has been resolved and Currently, the `spark.streaming.kafka.maxRatePerPartition` is used as the initial rate when the backpressure is enabled.
If the batch processing time is more than the batch interval, data is going to pile up in the system and eventually will run out of resources. If the batch interval is way less than batch processing time, it is a waste cluster resources.
According to spark documentation:
"A good approach to figuring out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate".
However, this has its own drawback. For long-running spark application that runs in production for months, things change over time. Sometimes the message characteristics such as message size change over time, causing the processing time of the same number of messages varies. Sometimes a multi-tenant cluster becomes busy during the daytime when other big data applications such as Impala/Hive/MR jobs compete for shared system resources such as CPU/Memory/Network/Disk IO.
The backpressure comes to the rescue! Backpressure was a highly demanded feature that allows the ingestion rate to be set dynamically and automatically, basing on previous micro-batch processing time. Such feedback loop makes it possible to adapt to the fluctuation nature of the streaming application.
This backpressure can be enabled by setting the configuration parameter :
sparkConf.set("spark.streaming.backpressure.enabled",”true”)
What about the first micro-batch rate? Since there is no previous micro-batch processing time available, there is no basis to estimate what is the optimal rate the application should use. According to this ticket,(SPARK-18580), this has been resolved and Currently, the `spark.streaming.kafka.maxRatePerPartition` is used as the initial rate when the backpressure is enabled.
Thursday, 3 May 2018
Differences between PYTHON and JAVA.
Differences between PYTHON and JAVA.
In this blog, I am trying to list out the differences between Python and Java.* Python and Java are two very different programming languages.
* Dynamic vs Static Typing
In python, you can declare a variable without type, python will infer the type of the variable based on the value assigned to it.
for eg:
x = 1 #here type of variable 'x' is number.
x = "hello" #here type of variable 'x' is string.
#in both case, we are not specifying the type of the variable.
Java forces you to define the type of a variable when you first declare it.A language is statically typed if the type of a variable is known at compile time. - static typing.
* Braces vs Indentation
Python uses indentation to separate code into blocks.
Java uses curly braces to define the beginning and end of code blocks.
* Portability
To run Python programs you need a compiler that can turn Python code into code that your particular operating system can understand. - source code to machine code.
Java is platform independent, Java code is compiled into bytecode which can be executed by JVM(platform dependent) of specific devices.
* Verbosity and Simplicity
Java syntax is verbose; Python syntax is clean and simple.
* Semicolon
Python statements do not need a semicolon to end.
if you miss a semicolon in Java, it throws an error.
* Comments
JAVA:
//This is a single-line comment
/*This is a multiline comment
Yes it is*/
PYTHON:
#This is a comment
Friday, 9 February 2018
Overview of BDB predictive Workbench
Overview of BDB predictive Workbench.
BDB Predictive Analytics workbench is a statistical analysis and data mining solution that enables you to build predictive models to discover hidden insights and relationships in your data, from which you can make predictions about future events. It is a data scientist work bench with R, Python and Spark capabilities. By using this, we can train different machine learning models and evaluate those models on basis of accuracy and various other parameters. If the model accuracy is good enough, we can deploy the model on top of actual data, which will give you the predictions. This is a pluggable feature of the BizViz platform. Predictive analysis supports different data sources like Bizviz data service (MySQL, MsSQL, Oracle and Cassandra) and files (CSV, TSV, txt).
Supports most of the popular statistical algorithms like association, classification, clustering, outliers, regression, time series forecasting (R - implementation). For Big data processing data sources like HDFS and Cassandra are supported. Processing layer for Big data is apache spark cluster using Spark MLLIb.
Here you can watch a demo of BDB predictive workbench:
Friday, 17 November 2017
Apache Spark window functions and User defined function example.
Apache Spark window functions and User defined function example.
Apache Spark window functions and User defined function example:
Friday, 10 November 2017
ML/Recommendation Engine using BDB Predictive Workbench
Case Study: Bizviz Predictive Workbench
Our Client is a food franchise company that delivers meals to kids and families whereever they learn and play. Each kitchen is an independently owned and operated franchise, managed by a local business person who manages the production and distribution of nutritious meals for children and families within their community.
To suggest user from a pool of products while ordering, we created a recommendation engine that will take feedback about a product from other customers and suggest the most relevant product that suites the customers.
For more details:-
Thursday, 9 November 2017
Kafka partitions Explained.
Kafka partitions Explained.
As all we know, Kafka topics are divided into number of partitions. A topic can have zero or more number of partitions. Topic partition is a way of implementing parallelism in Kafka.Producer can write data into different partitions parallelly and consumers in a consumer group can read them parallelly. Each consumer in a consumer group is bound to a particular partition.Kafka always give single partitions data to one consumer thread. So the degree of parallelism is based on the number of partitions and number of consumers in a consumer group. you cannot have more number of consumers than the total number of partitions. If you are adding more number of consumers, extra consumers will be idle. Of-course, you can start with less number of consumers than number of partitions and you can add the consumer later. When you add a new consumer, group coordinator will coordinate and assign one particular partition for the new consumer.
In producer side, you can produce data to a topic without worrying about the partitions.By default Kafka will use default partition-er. The Producer config property partitioner.class sets the partitioner. By default partitioner.class is set to org.apache.kafka.clients.producer.internals.DefaultPartitioner. The default partitioner partitions using the hash of the record key, if the record has a key.for example, If you want to send particular customer data to a particular partition, you can use some unique id as record key for each customer. Kafka will use hash of this key to find exact topic partition for that customer, this way we can kafka will make sure same each partition holds one particular customer data. The default partitioner partitions using round-robin if the record has no key.You can define a custom partitioner as well.
As all we know, Kafka topics are divided into number of partitions. A topic can have zero or more number of partitions. Topic partition is a way of implementing parallelism in Kafka.Producer can write data into different partitions parallelly and consumers in a consumer group can read them parallelly. Each consumer in a consumer group is bound to a particular partition.Kafka always give single partitions data to one consumer thread. So the degree of parallelism is based on the number of partitions and number of consumers in a consumer group. you cannot have more number of consumers than the total number of partitions. If you are adding more number of consumers, extra consumers will be idle. Of-course, you can start with less number of consumers than number of partitions and you can add the consumer later. When you add a new consumer, group coordinator will coordinate and assign one particular partition for the new consumer.
In producer side, you can produce data to a topic without worrying about the partitions.By default Kafka will use default partition-er. The Producer config property partitioner.class sets the partitioner. By default partitioner.class is set to org.apache.kafka.clients.producer.internals.DefaultPartitioner. The default partitioner partitions using the hash of the record key, if the record has a key.for example, If you want to send particular customer data to a particular partition, you can use some unique id as record key for each customer. Kafka will use hash of this key to find exact topic partition for that customer, this way we can kafka will make sure same each partition holds one particular customer data. The default partitioner partitions using round-robin if the record has no key.You can define a custom partitioner as well.
Tuesday, 7 November 2017
Machine Learning through example.
What is machine learning?
The purpose of this Blog is to explain about machine learning as simple as possible using a simple example. Our aim is to create a system, that answer the question, whether given drink is wine or beer?. This question answering system that we are going to build is called model and this model is going to create via process called training.
What is Training?
In machine learning, goal of training is to create accurate model, that answers our questions correctly most of the time. In order to train a model we need to collect the data. There are many aspects of drinks that we can collect data on. for simplicity, here we will collect two aspects of drinks, colour and alcohol percentage. we hope, we can split two types of drinks based on these two factors alone. we call these factors as features. The quality and quantity of data you gather will directly determine how good your predictive model would be. At this point we can collect some training data, create a table with three columns, namely colour, alcohol %, beer or wine.
Data preparation.
Next step is data preparation, We load our data into a suitable place and prepare for use. we can use visualization techniques to check for data imbalance or finding anomalies in data. For example, if you have collected more data points for beer than wine, our model is going to be heavily biased towards beer. Make sure order of these data is random. We also need to split our data into two part, preferably 80-20. First part we will use for training and the second part we will use for evaluating our model. In this step we may have to do lots of other data preparation techniques in order to clean our data, such as normalization, duplicate detection, finding outliers, converting some text values to its number equivalent etc.(Some algorithms would accept only numeric values).
Selecting appropriate model.
Next step is choosing a model. There are many models that researchers have created over the years. Some are suited for image data,some for numerical data, some for text based data. In our case we have just two features, so we can use simple linear model.
Training the model.
Now we can move on to training. In this step we will incrementally use our user data to improve the ability of our model to predict whether the given drink is beer or wine. When we start our training at first, the model draw a random line through the data. Then as the each step of the training progresses, the line moves step by step closer to the idea of separation of the wine and beer. Once training is complete its time to evaluate the model. Evaluation, allow us to test our model against our data, which is never been used for training. Once you are done with evaluation,it is possible to see you can further improve your model. We can do this by tuning some of our parameters, that we implicitly assumed during our training. one example of such parameter is no of iterations.
Deploy the model.
Final step is to deploy our model. we can finally use our model to predict, whether the given drink is beer or wine.
Cheers
Hope this would be helpful.
The purpose of this Blog is to explain about machine learning as simple as possible using a simple example. Our aim is to create a system, that answer the question, whether given drink is wine or beer?. This question answering system that we are going to build is called model and this model is going to create via process called training.
What is Training?
In machine learning, goal of training is to create accurate model, that answers our questions correctly most of the time. In order to train a model we need to collect the data. There are many aspects of drinks that we can collect data on. for simplicity, here we will collect two aspects of drinks, colour and alcohol percentage. we hope, we can split two types of drinks based on these two factors alone. we call these factors as features. The quality and quantity of data you gather will directly determine how good your predictive model would be. At this point we can collect some training data, create a table with three columns, namely colour, alcohol %, beer or wine.
Data preparation.
Next step is data preparation, We load our data into a suitable place and prepare for use. we can use visualization techniques to check for data imbalance or finding anomalies in data. For example, if you have collected more data points for beer than wine, our model is going to be heavily biased towards beer. Make sure order of these data is random. We also need to split our data into two part, preferably 80-20. First part we will use for training and the second part we will use for evaluating our model. In this step we may have to do lots of other data preparation techniques in order to clean our data, such as normalization, duplicate detection, finding outliers, converting some text values to its number equivalent etc.(Some algorithms would accept only numeric values).
Selecting appropriate model.
Next step is choosing a model. There are many models that researchers have created over the years. Some are suited for image data,some for numerical data, some for text based data. In our case we have just two features, so we can use simple linear model.
Training the model.
Now we can move on to training. In this step we will incrementally use our user data to improve the ability of our model to predict whether the given drink is beer or wine. When we start our training at first, the model draw a random line through the data. Then as the each step of the training progresses, the line moves step by step closer to the idea of separation of the wine and beer. Once training is complete its time to evaluate the model. Evaluation, allow us to test our model against our data, which is never been used for training. Once you are done with evaluation,it is possible to see you can further improve your model. We can do this by tuning some of our parameters, that we implicitly assumed during our training. one example of such parameter is no of iterations.
Deploy the model.
Final step is to deploy our model. we can finally use our model to predict, whether the given drink is beer or wine.
Cheers
Hope this would be helpful.
Monday, 23 October 2017
Difference between DataFrame and DataSet in Spark 2.0
Purpose of this this Blog is to list out maximum difference between DF and DS.
From spark version 2.0,Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.Dataset is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.
DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.
DataFrame -> DataSet[Row] - here Row is a untyped generic JVM object, faster and suitable for interactive analysis.
DataSet[T] -> typed API is optimized for data engineering tasks.
Static-typing and runtime type-safety
DataSet, most restrictive. Since Dataset APIs are all expressed as lambda functions and JVM typed objects, any mismatch of typed-parameters will be detected at compile time. Also, your analysis error can be detected at compile time too, when using Datasets, hence saving developer-time and costs.
If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
DataFrame disadvantage over DataSet: Lack of Type Safety. As a developer, i will not like using DataFrame as it doesn't seem developer friendly. Referring attribute by String names means no compile time safety. Things can fail at runtime. Also APIs doesn't look programmatic and more of SQL kind.
Advantage of DataSet over DataFrame:
It has an additional feature : Encoders
Encoders act as interface between JVM objects and off-heap custom memory binary format data.
Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object.
case class is used to define the structure of data schema in DataSet. Using case class, its very easy to work with DataSet. Names of different attributes in case class is directly mapped to field names in DataSet. It gives feeling like working with RDD but actually underneath it works same as DataFrame.
I hope this would be helpful. If any of you have any additional point please comment.
From spark version 2.0,Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.Dataset is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.
DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.
DataFrame -> DataSet[Row] - here Row is a untyped generic JVM object, faster and suitable for interactive analysis.
DataSet[T] -> typed API is optimized for data engineering tasks.
Static-typing and runtime type-safety
DataSet, most restrictive. Since Dataset APIs are all expressed as lambda functions and JVM typed objects, any mismatch of typed-parameters will be detected at compile time. Also, your analysis error can be detected at compile time too, when using Datasets, hence saving developer-time and costs.
If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
DataFrame disadvantage over DataSet: Lack of Type Safety. As a developer, i will not like using DataFrame as it doesn't seem developer friendly. Referring attribute by String names means no compile time safety. Things can fail at runtime. Also APIs doesn't look programmatic and more of SQL kind.
Advantage of DataSet over DataFrame:
It has an additional feature : Encoders
Encoders act as interface between JVM objects and off-heap custom memory binary format data.
Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object.
case class is used to define the structure of data schema in DataSet. Using case class, its very easy to work with DataSet. Names of different attributes in case class is directly mapped to field names in DataSet. It gives feeling like working with RDD but actually underneath it works same as DataFrame.
I hope this would be helpful. If any of you have any additional point please comment.
Subscribe to:
Posts (Atom)
Python and packages offline Installation in ubuntu machine.
https://www.linkedin.com/pulse/python-packages-offline-installation-ubuntu-machine-sijo-jose/
-
Spark kafka streaming example Please find below spark kafka streaming example - SubscribePattern Using SubscribePattern you can subscrib...
-
Overview of BDB predictive Workbench. BDB Predictive Analytics workbench is a statistical analysis and data mining solution that enable...