Friday, August 25, 2017

Unlocking Kafka Exactly-one Semantic

How to implement “Exactly-one Semantics” at Kafka side. Before that you, you have to check Kafka version:

If its Kafka 0.11 release then you can try enabling Exactly-once semantics feature. Prior to 0.11 Kafka support at least one delivery but that can cause duplicate results

Exactly-once one semantics has been implemented using concept. Idempotence and Transaction Management features
How to enable Exactly-once Semantics

In Producer config, please add/modify below property
enable.idempotence = True
acks=all
retries=Integer.MAX_VALUE
max.inflight.requests.per.connection=1
transactional.id =  < some unique id>

In Consumer config, please add/modify below property
isolation.level =  <read_committed or read_uncommitted>



References:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics


https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Tuesday, February 7, 2017

Hadoop multinode cluster setup on ubantu using apache ambari

1. Pre-requsite

# setup password less ssh


# turn off iptables
https://help.ubuntu.com/community/IptablesHowTo#Configuration%20on%20startup

# disable silenix

# disable ipv6


2. Prepare for setup

cd /etc/apt/sources.list.d

 wget http://public-repo-1.hortonworks.com/ambari/ubuntu14/2.x/updates/2.2.2.0/ambari.list 

  apt-key adv --recv-keys --keyserver keyserver.ubuntu.com B9733A7A07513CAD
  apt-get update
  apt-get install ambari-server


ambari-server setup

go with default

ambari-server start


Access ambari server using ip or localhost:8080

#How to completly uninstall ambari/hadoop

apt-get remove ambari-server
apt-get remove ambari-agent
apt-get purge postgresql

rm -rf /usr/lib/ambari-*
rm -rf /etc/ambari/*
rm -rf /var/log/ambar/*

http://www.yourtechchick.com/hadoop/how-to-completely-remove-and-uninstall-hdp-components-hadoop-uninstall-on-linux-system/

https://community.hortonworks.com/questions/1110/how-to-completely-remove-uninstall-ambari-and-hdp.html

 Ip points

# all serves should have all servers host entry in /etc/host
 # remove localhost details

# Start with basic setup(HDFS and Yarn+MapReduce) first






 

Monday, October 17, 2016

Spark Setup and installation on window

Spark Setup and installation on window

1. Download r
https://cran.r-project.org/mirrors.html
2. Download Scala
http://www.scala-lang.org/download/2.10.6.html
3. Jdl 1.7 +
4. Download python

5. After saving the file I have downloaded the hadoop binary file winutils.exe, even though Spark runs independently of Hadoop, there is a bug which searches for winutils.exe which is needed for hadoop, and throws up an error.

I have download the file from the below mentioned link

http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

I have created a folder named winutils in c:\ and created bin directory and placed the winutils.exe file in it. The file location is as follows.

C:\winutils\bin\winutils.exe

6. Setup envionrment variable
Pressing WIN + R button which open’s up the run and enter sysdm.cpl
I then clicked on advanced tab and then on environment variables. Clicked new for the user variables and added the following

variable name as HADOOP_HOME  and it's value as C:\winutils
variable name as SPARK_HOME and it's value as C:\spark

7. Install Jupyter
https://jupyter.readthedocs.io/en/latest/install.html

8. run jupyter notebook

*************************

Continue with reading
http://spark.apache.org/docs/latest/programming-guide.html#overview

************************


Troubleshooting stesps

Wrong => trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
right Pytyon 3.0=> testErr = labelsAndPredictions.filter(lambda v: v[0] != v[1]).count() / float(testData.count())

model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)



Error
[1] "The data can be downloaded from: http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv "

  java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

Thursday, September 1, 2016

Hive LLAP (Live Long and Process)

hive> set hive.llap.execution.mode=none;

to change into llap mode use the following command

hive> set hive.llap.execution.mode=all;

LLAP is an optional daemon process running on multiple nodes, that provides the following

- Caching and data reuse across queries with compressed columnar data in-memory

- Multi threaded execution including reads with predicate pushdown and hash join

- High throughput IO using Asynch IO Elevator with dedicated thread and core per disk

- Granular column level security across applications

Tuesday, May 17, 2016

Shipping BigData Science Tools in Docker Container

Shipping BigData Science Tools in Docker Container




Docker made our life very easy as Data Science. Using docker, we can install tools related to
Data science very easily without the hassle of configuration.

This blog is dedicated to installation of Hadoop/Spark environment using Docker images
But before going there, let me introduce you Docker first. So entire section is devided into following
steps:

1. Docker Concept
2. How to install Docker on window
3. Install Spark using docker images

Lets start with Docker

Virtual machine takes long time to boot up and require lots of packages and dependecies to boot up. Linux container(LCX) solve the problem by enabling
multiple isolated environments to run on a single machine.
More info about LCX pleas refer wiki page https://en.wikipedia.org/wiki/LXC

LCX is heart/base of Docker. 






More info on Docker please refer "docker.training"

How to install Docker on window machine. please refer the Docker documentation below
https://docs.docker.com/windows/step_one/

Once docker is up and running fine, please follow below steps to install/run jupyter notebook
1. Click on "Docker Quickstart Terminal"
2. Go to bash sheel (type "bash" and hit return key)
3. On docker terminal run following command
docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
It will take time as firt it will look for spark image on local machine and will start downloading images from docker hub
Please wait untill all images gets downloaded and extracted properly without any error
on sucessful execution of docker run command you will see following messages
$ docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
/
Starting sshd:                                             [  OK  ]
Starting namenodes on [sandbox]
sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-nameno
de-sandbox.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-data
node-sandbox.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-ro
ot-secondarynamenode-sandbox.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanage
r-sandbox.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nod
emanager-sandbox.out

Testing of Spark setup 

4. cd /usr/local/spark
5. run "./bin/spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1"
6. scala> sc.parallelize(1 to 1000).count()

7. run "./bin/spark-submit  --class org.apache.spark.examples.SparkPi --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 ./lib/spark-examples-1.2.0-hadoop2.4.0.jar


Cloudera Quick start with Docker image

Please follow the steps below.

https://hub.docker.com/r/cloudera/quickstart/