Tuesday, May 17, 2016

Shipping BigData Science Tools in Docker Container

Shipping BigData Science Tools in Docker Container

Docker made our life very easy as Data Science. Using docker, we can install tools related to
Data science very easily without the hassle of configuration.

This blog is dedicated to installation of Hadoop/Spark environment using Docker images
But before going there, let me introduce you Docker first. So entire section is devided into following

1. Docker Concept
2. How to install Docker on window
3. Install Spark using docker images

Lets start with Docker

Virtual machine takes long time to boot up and require lots of packages and dependecies to boot up. Linux container(LCX) solve the problem by enabling
multiple isolated environments to run on a single machine.
More info about LCX pleas refer wiki page https://en.wikipedia.org/wiki/LXC

LCX is heart/base of Docker. 

More info on Docker please refer "docker.training"

How to install Docker on window machine. please refer the Docker documentation below

Once docker is up and running fine, please follow below steps to install/run jupyter notebook
1. Click on "Docker Quickstart Terminal"
2. Go to bash sheel (type "bash" and hit return key)
3. On docker terminal run following command
docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
It will take time as firt it will look for spark image on local machine and will start downloading images from docker hub
Please wait untill all images gets downloaded and extracted properly without any error
on sucessful execution of docker run command you will see following messages
$ docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
Starting sshd:                                             [  OK  ]
Starting namenodes on [sandbox]
sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-nameno
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-data
Starting secondary namenodes [] starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-ro
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanage
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nod

Testing of Spark setup 

4. cd /usr/local/spark
5. run "./bin/spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1"
6. scala> sc.parallelize(1 to 1000).count()

7. run "./bin/spark-submit  --class org.apache.spark.examples.SparkPi --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 ./lib/spark-examples-1.2.0-hadoop2.4.0.jar

Cloudera Quick start with Docker image

Please follow the steps below.