Tuesday, February 7, 2017

Hadoop multinode cluster setup on ubantu using apache ambari

1. Pre-requsite

# setup password less ssh

# turn off iptables

# disable silenix

# disable ipv6

2. Prepare for setup

cd /etc/apt/sources.list.d

 wget http://public-repo-1.hortonworks.com/ambari/ubuntu14/2.x/updates/ 

  apt-key adv --recv-keys --keyserver keyserver.ubuntu.com B9733A7A07513CAD
  apt-get update
  apt-get install ambari-server

ambari-server setup

go with default

ambari-server start

Access ambari server using ip or localhost:8080

#How to completly uninstall ambari/hadoop

apt-get remove ambari-server
apt-get remove ambari-agent
apt-get purge postgresql

rm -rf /usr/lib/ambari-*
rm -rf /etc/ambari/*
rm -rf /var/log/ambar/*



 Ip points

# all serves should have all servers host entry in /etc/host
 # remove localhost details

# Start with basic setup(HDFS and Yarn+MapReduce) first


Monday, October 17, 2016

Spark Setup and installation on window

Spark Setup and installation on window

1. Download r
2. Download Scala
3. Jdl 1.7 +
4. Download python

5. After saving the file I have downloaded the hadoop binary file winutils.exe, even though Spark runs independently of Hadoop, there is a bug which searches for winutils.exe which is needed for hadoop, and throws up an error.

I have download the file from the below mentioned link


I have created a folder named winutils in c:\ and created bin directory and placed the winutils.exe file in it. The file location is as follows.


6. Setup envionrment variable
Pressing WIN + R button which open’s up the run and enter sysdm.cpl
I then clicked on advanced tab and then on environment variables. Clicked new for the user variables and added the following

variable name as HADOOP_HOME  and it's value as C:\winutils
variable name as SPARK_HOME and it's value as C:\spark

7. Install Jupyter

8. run jupyter notebook


Continue with reading


Troubleshooting stesps

Wrong => trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
right Pytyon 3.0=> testErr = labelsAndPredictions.filter(lambda v: v[0] != v[1]).count() / float(testData.count())

model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)

[1] "The data can be downloaded from: http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv "

  java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

Thursday, September 1, 2016

Hive LLAP (Live Long and Process)

hive> set hive.llap.execution.mode=none;

to change into llap mode use the following command

hive> set hive.llap.execution.mode=all;

LLAP is an optional daemon process running on multiple nodes, that provides the following

- Caching and data reuse across queries with compressed columnar data in-memory

- Multi threaded execution including reads with predicate pushdown and hash join

- High throughput IO using Asynch IO Elevator with dedicated thread and core per disk

- Granular column level security across applications

Tuesday, May 17, 2016

Shipping BigData Science Tools in Docker Container

Shipping BigData Science Tools in Docker Container

Docker made our life very easy as Data Science. Using docker, we can install tools related to
Data science very easily without the hassle of configuration.

This blog is dedicated to installation of Hadoop/Spark environment using Docker images
But before going there, let me introduce you Docker first. So entire section is devided into following

1. Docker Concept
2. How to install Docker on window
3. Install Spark using docker images

Lets start with Docker

Virtual machine takes long time to boot up and require lots of packages and dependecies to boot up. Linux container(LCX) solve the problem by enabling
multiple isolated environments to run on a single machine.
More info about LCX pleas refer wiki page https://en.wikipedia.org/wiki/LXC

LCX is heart/base of Docker. 

More info on Docker please refer "docker.training"

How to install Docker on window machine. please refer the Docker documentation below

Once docker is up and running fine, please follow below steps to install/run jupyter notebook
1. Click on "Docker Quickstart Terminal"
2. Go to bash sheel (type "bash" and hit return key)
3. On docker terminal run following command
docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
It will take time as firt it will look for spark image on local machine and will start downloading images from docker hub
Please wait untill all images gets downloaded and extracted properly without any error
on sucessful execution of docker run command you will see following messages
$ docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
Starting sshd:                                             [  OK  ]
Starting namenodes on [sandbox]
sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-nameno
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-data
Starting secondary namenodes [] starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-ro
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanage
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nod

Testing of Spark setup 

4. cd /usr/local/spark
5. run "./bin/spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1"
6. scala> sc.parallelize(1 to 1000).count()

7. run "./bin/spark-submit  --class org.apache.spark.examples.SparkPi --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 ./lib/spark-examples-1.2.0-hadoop2.4.0.jar

Cloudera Quick start with Docker image

Please follow the steps below.