Friday 17 November 2017

Apache Spark window functions and User defined function example.

Apache Spark window functions and User defined function example.
Apache Spark window functions and User defined function example:

Friday 10 November 2017

ML/Recommendation Engine using BDB Predictive Workbench


Case Study: Bizviz Predictive Workbench

Our Client is a food franchise company that delivers meals to kids and families whereever they learn and play. Each kitchen is an independently owned and operated franchise, managed by a local business person who manages the production and distribution of nutritious meals for children and families within their community. 
To suggest user from a pool of products while ordering, we created a recommendation engine that will take feedback about a product from other customers and suggest the most relevant product that suites the customers.

For more details:-

Thursday 9 November 2017

Kafka partitions Explained.

Kafka partitions Explained.

As all we know, Kafka topics are divided into number of partitions. A topic can have zero or more number of partitions. Topic partition is a way of implementing parallelism in Kafka.Producer can write data into different partitions parallelly and consumers in a consumer group can read them parallelly. Each consumer in a consumer group is bound to a particular partition.Kafka always give single partitions data to one consumer thread. So the degree of parallelism is based on the number of partitions and number of consumers in a consumer group. you cannot have more number of consumers than the total number of partitions. If you are adding more number of consumers, extra consumers will be idle. Of-course, you can start with less number of consumers than number of partitions and you can add the consumer later. When you add a new consumer, group coordinator will coordinate and assign one particular partition for the new consumer.   

In producer side, you can produce data to a topic without worrying about the partitions.By default Kafka will use default partition-er.  The Producer config property partitioner.class sets the partitioner. By default partitioner.class is set to org.apache.kafka.clients.producer.internals.DefaultPartitioner. The default partitioner partitions using the hash of the record key, if the record has a key.for example, If you want to send particular customer data to a particular partition, you can use some unique id as record key for each customer. Kafka will use hash of this key to find exact topic partition for that customer, this way we can kafka will make sure same each partition holds one particular customer data. The default partitioner partitions using round-robin if the record has no key.You can define a custom partitioner as well. 

Tuesday 7 November 2017

Machine Learning through example.

What is machine learning?

The purpose of this Blog is to explain about machine learning as simple as possible using a simple example. Our aim is to create a system, that answer the question, whether given drink is wine or beer?. This question answering system that we are going to build is called model and this model is going to create via process called training.

What is Training?

In machine learning, goal of training is to create accurate model, that answers our questions correctly most of the time. In order to train a model we need to collect the data. There are many aspects of drinks that we can collect data on. for simplicity, here we will collect two aspects of drinks, colour and alcohol percentage. we hope, we can split two types of drinks based on these two factors alone. we call these factors as features. The quality and quantity of data you gather will directly determine how good your predictive model would be. At this point we can collect some training data, create a table with three columns, namely colour, alcohol %, beer or wine.

Data preparation.

 Next step is data preparation, We load our data into a suitable place and prepare for use. we can use visualization techniques to check for data imbalance or finding anomalies in data. For example, if you have collected more data points for beer than wine, our model is going to be heavily biased towards beer. Make sure order of these data is random. We also need to split our data into two part, preferably 80-20. First part we will use for training and the second part we will use for evaluating our model. In this step we may have to do lots of other data preparation techniques in order to clean our data, such as normalization, duplicate detection, finding outliers, converting some text values to its number equivalent etc.(Some algorithms would accept only numeric values).

Selecting appropriate model.

Next step is choosing a model. There are many models that researchers have created over the years. Some are suited for image data,some for numerical data, some for text based data. In our case we have just two features, so we can use simple linear model.

Training the model.

Now we can move on to training. In this step we will incrementally use our user data to improve the ability of our model to predict whether the given drink is beer or wine. When we start our training at first, the model draw a random line through the data. Then as the each step of the training progresses, the line moves step by step closer to the idea of separation of the wine and beer. Once training is complete its time to evaluate the model. Evaluation, allow us to test our model against our data, which is never been used for training. Once you are done with evaluation,it is possible to see you can further improve your model. We can do this by tuning some of our parameters, that we implicitly assumed during our training. one example of such parameter is no of iterations.

Deploy the model.

Final step is to deploy our model. we can finally use our model to predict, whether the given drink is beer or wine.

Cheers
Hope this would be helpful.

Python and packages offline Installation in ubuntu machine.

https://www.linkedin.com/pulse/python-packages-offline-installation-ubuntu-machine-sijo-jose/