Bhupendra Mishra: 09/18/15

Apache Oozie Workflow scheduler for Hadoop

- Oozie is a workflow scheduler system to manage apache hadoop jobs
- Oozie workflow jobs are Directed Acyclical Graphs(DAGs) of actions
- Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time and data availbility

Programms are dependent to each other to achieve common goal

Oozie acts as a middle man between the user and hadoop. The user provides his job to Oozie and Oozie executes it on Hadoop via a launcher job followed by returing results.

User has to design workflow.xml with setu pf actions arranged in a DAG(Direct Acyclical Graph) as shown below

Three component required to run Simple MapReduce workflow

1. Properties file: job. properties
sample job.properties

nameNode=hdfs://localhost.localdomain:50070

jobTracker=localhost.localdomain:50030

queueName=default

examplesRoot=map-reduce

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}

inputDir=input-data

outputDir=map-reduce

2. workflow.xml : this file defines the workflow for the particular job as a set of actions
Sample workflow.xml

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.2">

    <start to='wordcount'/>

    <action name='wordcount'>

        <map-reduce>

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>

            <prepare>

            </prepare>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>

                <property>

                    <name>mapred.mapper.class</name>

                    <value>org.myorg.WordCount.Map</value>

                </property>

                <property>

                    <name>mapred.reducer.class</name>

                    <value>org.myorg.WordCount.Reduce</value>

                </property>

                <property>

                    <name>mapred.input.dir</name>

                    <value>${inputDir}</value>

                </property>

                <property>

                    <name>mapred.output.dir</name>

                    <value>${outputDir}</value>

                </property>

            </configuration>

        </map-reduce>

        <ok to='end'/>

        <error to='end'/>

    </action>

    <kill name='kill'>

        <value>${wf:errorCode("wordcount")}</value>

    </kill/>

    <end name='end'/>

</workflow-app>

3> Libraries
lib/ this is directory in the HDFS which containes libraries used in workflow for example .har and shared object files. During run time, the OOzie server picks up contents of this directory and deploys them on the actual compute node using Hadoop Distributed cache. When submitting job from local system, this lib directory would have to manually copied over hdfs before workflow can be run

command to submit and execute mapreduce program via oozie

1. Create a directory to store all your workflow components(job.properties, workflow.xml and libraries). Inside this director create lib director

# mkdir mapred
#mkdir mapred/lib

Please note, that this workflow directory and its content are created on local file system and they must be copied to the HDFS filesystem once they are ready

2. Create the files workflow.xml and job. properties as follows

#cd mapred
#vi workflow.xml

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.2">

    <start to='wordcount'/>

    <action name='wordcount'>

        <map-reduce>

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>

            <prepare>

            </prepare>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>

                <property>

                    <name>mapred.mapper.class</name>

                    <value>org.myorg.WordCount.Map</value>

                </property>

                <property>

                    <name>mapred.reducer.class</name>

                    <value>org.myorg.WordCount.Reduce</value>

                </property>

                <property>

                    <name>mapred.input.dir</name>

                    <value>${inputDir}</value>

                </property>

                <property>

                    <name>mapred.output.dir</name>

                    <value>${outputDir}</value>

                </property>

            </configuration>

        </map-reduce>

        <ok to='end'/>

        <error to='end'/>

    </action>

    <kill name='kill'>

        <value>${wf:errorCode("wordcount")}</value>

    </kill/>

    <end name='end'/>

</workflow-app>

# vi job.properties

nameNode=hdfs://localhost.localdomain:50070

jobTracker=localhost.localdomain:50030

queueName=default

examplesRoot=map-reduce

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}

inputDir=input-data

outputDir=map-reduce

5. Copy the WordCount.jar file into the workflow lib directory and  directory structure should look like 

# cd mapred

job.properties

workflow.xml

lib/

lib/wordcount.jar

6. Copy the workflow directory to hdfs

#hadoop fs -copyFromLocal mapred/ mapred

please note that the job.properties file is always used from the local file system and it need not be copied to hdfs filesysm

7. run the workflow using below command

# oozie job --oozie http://localhost:5001/oozie -config /home/cloudera/mapred/job.properties -run

Once the workflow is submitted, Oozie server returns the workflow ID which can be used for monitoring and debugging purpose

Please note on parameter 

-config option specify location of your job.properties file which is local to your machine

- oozie option sepecify the location of oozie server

Reference: https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook

Friday, September 18, 2015

OOzie