Apache Oozie Workflow scheduler for Hadoop
- Oozie is a workflow scheduler system to manage apache hadoop jobs
- Oozie workflow jobs are Directed Acyclical Graphs(DAGs) of actions
- Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time and data availbility
Programms are dependent to each other to achieve common goal
Oozie acts as a middle man between the user and hadoop. The user provides his job to Oozie and Oozie executes it on Hadoop via a launcher job followed by returing results.
User has to design workflow.xml with setu pf actions arranged in a DAG(Direct Acyclical Graph) as shown below
Three component required to run Simple MapReduce workflow
1. Properties file: job. properties
sample job.properties
2. workflow.xml : this file defines the workflow for the particular job as a set of actions
Sample workflow.xml
3> Libraries
lib/ this is directory in the HDFS which containes libraries used in workflow for example .har and shared object files. During run time, the OOzie server picks up contents of this directory and deploys them on the actual compute node using Hadoop Distributed cache. When submitting job from local system, this lib directory would have to manually copied over hdfs before workflow can be run
command to submit and execute mapreduce program via oozie
1. Create a directory to store all your workflow components(job.properties, workflow.xml and libraries). Inside this director create lib director
# mkdir mapred
#mkdir mapred/lib
Please note, that this workflow directory and its content are created on local file system and they must be copied to the HDFS filesystem once they are ready
2. Create the files workflow.xml and job. properties as follows
#cd mapred
#vi workflow.xml
# vi job.properties
- Oozie is a workflow scheduler system to manage apache hadoop jobs
- Oozie workflow jobs are Directed Acyclical Graphs(DAGs) of actions
- Oozie coordinator jobs are recurrent Oozie workflow jobs triggered by time and data availbility
Programms are dependent to each other to achieve common goal
Oozie acts as a middle man between the user and hadoop. The user provides his job to Oozie and Oozie executes it on Hadoop via a launcher job followed by returing results.
User has to design workflow.xml with setu pf actions arranged in a DAG(Direct Acyclical Graph) as shown below
Three component required to run Simple MapReduce workflow
1. Properties file: job. properties
sample job.properties
nameNode=hdfs:
//localhost.localdomain:50070
jobTracker=
localhost.localdomain:50030
queueName=
default
examplesRoot=map-reduce
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
inputDir=input-data
outputDir=map-reduce
2. workflow.xml : this file defines the workflow for the particular job as a set of actions
Sample workflow.xml
<
workflow-app
name
=
'wordcount-wf'
xmlns
=
"uri:oozie:workflow:0.2"
>
<
start
to
=
'wordcount'
/>
<
action
name
=
'wordcount'
>
<
map-reduce
>
<
job-tracker
>${jobTracker}</
job-tracker
>
<
name-node
>${nameNode}</
name-node
>
<
prepare
>
</
prepare
>
<
configuration
>
<
property
>
<
name
>mapred.job.queue.name</
name
>
<
value
>${queueName}</
value
>
</
property
>
<
property
>
<
name
>mapred.mapper.class</
name
>
<
value
>org.myorg.WordCount.Map</
value
>
</
property
>
<
property
>
<
name
>mapred.reducer.class</
name
>
<
value
>org.myorg.WordCount.Reduce</
value
>
</
property
>
<
property
>
<
name
>mapred.input.dir</
name
>
<
value
>${inputDir}</
value
>
</
property
>
<
property
>
<
name
>mapred.output.dir</
name
>
<
value
>${outputDir}</
value
>
</
property
>
</
configuration
>
</
map-reduce
>
<
ok
to
=
'end'
/>
<
error
to
=
'end'
/>
</
action
>
<
kill
name
=
'kill'
>
<
value
>${wf:errorCode("wordcount")}</
value
>
</
kill
/>
<
end
name
=
'end'
/>
</
workflow-app
>
3> Libraries
lib/ this is directory in the HDFS which containes libraries used in workflow for example .har and shared object files. During run time, the OOzie server picks up contents of this directory and deploys them on the actual compute node using Hadoop Distributed cache. When submitting job from local system, this lib directory would have to manually copied over hdfs before workflow can be run
command to submit and execute mapreduce program via oozie
1. Create a directory to store all your workflow components(job.properties, workflow.xml and libraries). Inside this director create lib director
# mkdir mapred
#mkdir mapred/lib
Please note, that this workflow directory and its content are created on local file system and they must be copied to the HDFS filesystem once they are ready
2. Create the files workflow.xml and job. properties as follows
#cd mapred
#vi workflow.xml
<
workflow-app
name
=
'wordcount-wf'
xmlns
=
"uri:oozie:workflow:0.2"
>
<
start
to
=
'wordcount'
/>
<
action
name
=
'wordcount'
>
<
map-reduce
>
<
job-tracker
>${jobTracker}</
job-tracker
>
<
name-node
>${nameNode}</
name-node
>
<
prepare
>
</
prepare
>
<
configuration
>
<
property
>
<
name
>mapred.job.queue.name</
name
>
<
value
>${queueName}</
value
>
</
property
>
<
property
>
<
name
>mapred.mapper.class</
name
>
<
value
>org.myorg.WordCount.Map</
value
>
</
property
>
<
property
>
<
name
>mapred.reducer.class</
name
>
<
value
>org.myorg.WordCount.Reduce</
value
>
</
property
>
<
property
>
<
name
>mapred.input.dir</
name
>
<
value
>${inputDir}</
value
>
</
property
>
<
property
>
<
name
>mapred.output.dir</
name
>
<
value
>${outputDir}</
value
>
</
property
>
</
configuration
>
</
map-reduce
>
<
ok
to
=
'end'
/>
<
error
to
=
'end'
/>
</
action
>
<
kill
name
=
'kill'
>
<
value
>${wf:errorCode("wordcount")}</
value
>
</
kill
/>
<
end
name
=
'end'
/>
</
workflow-app
>
# vi job.properties
nameNode=hdfs:
//localhost.localdomain:50070
jobTracker=
localhost.localdomain:50030queueName=
default
examplesRoot=map-reduce
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}
inputDir=input-data
outputDir=map-reduce
5. Copy the WordCount.jar file into the workflow lib directory and directory structure should look like
# cd mapred
job.properties
workflow.xml
lib/
lib/wordcount.jar
6. Copy the workflow directory to hdfs
#hadoop fs -copyFromLocal mapred/ mapred
please note that the job.properties file is always used from the local file system and it need not be copied to hdfs filesysm
7. run the workflow using below command
# oozie job --oozie http://localhost:5001/oozie -config /home/cloudera/mapred/job.properties -run
Once the workflow is submitted, Oozie server returns the workflow ID which can be used for monitoring and debugging purpose
Please note on parameter
-config option specify location of your job.properties file which is local to your machine
- oozie option sepecify the location of oozie server
Reference: https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook