Monday, February 16, 2015

Hadoop ecosystem design in orgnization and things to be considered


Hadoop has become the de-facto platform for storing and processing large amounts of data and has found widespread applications. In the Hadoop ecosystem, you can store your data in one of the storage managers (for example, HDFS, HBase, Solr, etc.) and then use a processing framework to process the stored data. Hadoop first shipped with only one processing framework: MapReduce. Today, there are many other open source tools in the Hadoop ecosystem that can be used to process data in Hadoop; a few common tools include the following Apache projects: Hive, Pig, Spark, Cascading, Crunch, Tez, and Drill, along with Impala and Presto. Some of these frameworks are built on top of each other. For example, you can write queries in Hive that can run on MapReduce or Tez. Another example currently under development is the ability to run Hive queries on Spark.

Amidst all of these options, two key questions arise for Hadoop users:

Which processing frameworks are most commonly used?
How do I choose which framework(s) to use for my specific use case?
This post will you help answer both of these questions, giving you enough context to make an educated decision regarding the best processing framework for your specific use case.

Categories of processing frameworks

One can broadly classify processing frameworks in Hadoop into the following six categories:

General-purpose processing frameworks — These frameworks allow users to process data in Hadoop using a low-level API. Although these are all batch frameworks, they follow different programming models. Examples include MapReduce and Spark.
Abstraction frameworks — These frameworks allow users to process data using a higher level abstraction. These can be API-based — for example, Crunch and Cascading, or based on a custom DSL, such as Pig. These are typically built on top of a general-purpose processing framework.
SQL frameworks — These frameworks enable querying data in Hadoop using SQL. These can be built on top of a general-purpose framework, such as Hive, or as a stand-alone, special-purpose framework, such as Impala. Technically, SQL frameworks can be considered abstraction frameworks. However, given their high demand and slew of options available in this category, it makes sense to classify SQL frameworks as their own category.
Graph processing frameworks — These frameworks enable graph processing capabilities on Hadoop. They can be built on top of a general-purpose framework, such as Giraph, or as a stand-alone, special-purpose framework, such as GraphLab.
Machine learning frameworks — These frameworks enable machine learning analysis on Hadoop data. These can also be built on top of a general-purpose framework, such as MLlib (on Spark), or as a stand-alone, special-purpose framework, such as Oryx.
Real-time/streaming frameworks — These frameworks provide near real-time processing (several hundred milliseconds to few seconds latency) for data in the Hadoop ecosystem. They can be built on top of a generic framework, such as Spark Streaming (on Spark), or as a stand-alone, special-purpose framework, such as Storm.
The diagram below organizes common processing frameworks in the Hadoop ecosystem by classifying them into the six categories.



As you can see, some of these frameworks build on top of a general-purpose processing framework, while others don’t. Examples of frameworks that do not build on top of a general-purpose framework include Impala, Drill, and GraphLab. We’ll use the term special-purpose frameworks to refer to them from here on.

Note that there is another way to distinguish processing frameworks: based on their architecture. Frameworks that have active components, like a server (e.g., Hive), can be considered engines, while others that do not have an active component can simply be considered libraries (e.g., MLlib). (This distinction, however, does not impact end users; users who need a solid machine learning framework usually don’t care whether it’s architecturally considered a library or an engine.)

Now comes the million dollar question: which framework(s) should you use?

The answer depends on two major factors:

Your use case
The expertise/experience present in your organization
To decide, you should first pick the category of framework(s) you need, and then choose from the particular frameworks available within those categories. The next section should help you decide which processing framework(s) to use.

When to use each processing framework

General-purpose processing frameworks: You always need a general-purpose framework for your cluster. This is because all of the other kinds of frameworks only solve a specific use case (e.g., graph processing, machine learning, etc.), and by themselves, they are not sufficient for handling the variety of processing needs likely at your organization. Moreover, many of the other frameworks rely on general-purpose frameworks. Even the special-purpose frameworks that don’t build upon general-purpose frameworks often rely on bits and pieces of them.

The common frameworks in this category are MapReduce, Spark, and Tez — and newer frameworks, such as Apache Flink, are now emerging. As of today, MapReduce is typically always installed on clusters. Other general-purpose frameworks rely on bits and pieces from the MapReduce stack, like Input/Output formats. You can still use other frameworks like Tez or Spark, though, without having MapReduce installed on your cluster.

So, the question is: which of the general-purpose processing frameworks should you use? MapReduce is the most mature; however, it is arguably the slowest. Spark and Tez are both DAG frameworks and don’t have the overhead of always running a Map followed by a Reduce job; both are more flexible than MapReduce. Spark is one of the most popular projects in the Hadoop ecosystem and has a lot of traction. It is thought by many as the successor to MapReduce — I encourage you to use Spark over MapReduce wherever possible.

Notably, MapReduce and Spark have different API’s; this means that, unless you are using an abstraction framework, if you migrate from MapReduce to Spark, you’ll have to rewrite your jobs in Spark. It’s also worth noting that even though Spark is a general-purpose engine with other abstraction frameworks built upon it, it also provides high-level processing APIs. So in this way, Spark API can also be seen as an abstraction framework itself. Consequently, the amount of time and code required for writing a Spark job is usually much less than writing an equivalent MapReduce job.

At this point, Tez is best suited as a framework to build abstraction frameworks, instead of building applications using its API.

The important thing to note is that just because you have a general-purpose processing framework installed on your cluster doesn’t mean you have to write all of your processing jobs using that framework’s API. In fact, it is recommended to use abstraction frameworks (e.g., Pig, Crunch, Cascading) or SQL frameworks (e.g., Hive and Impala) for writing processing jobs wherever possible (there are two exceptions to this rule, as discussed in the next section).

Abstraction and SQL frameworks: Abstraction frameworks (e.g., Pig, Crunch, and Cascading) and SQL frameworks (e.g., Hive and Impala) reduce the amount of time spent writing jobs directly for the general-purpose frameworks.

Abstraction frameworks: As shown in the above diagram, Pig is an abstraction framework that can run on MapReduce, Spark, or Tez. Apache Crunch provides a higher level API that can be used to run MapReduce or Spark jobs. Cascading is another API based abstraction framework that can run on MapReduce or Tez.
SQL frameworks: As far as SQL engines go, Hive can run on top of MapReduce or Tez, and work is being done to make Hive run on Spark. There are several special-purpose SQL engines aimed at faster SQL, including Impala, Presto, and Apache Drill.
Key points on the benefits of using an abstraction or SQL framework:

You can save a lot of time by not having to implement common processing tasks using the low-level APIs of general-purpose frameworks.
You can change underlying general-purpose processing frameworks (as needed and applicable). Coding directly on the framework means you would have to re-write your jobs if you decided to change frameworks. Using an abstraction or SQL framework that builds upon a generic framework abstracts that away.
Running a job on an abstraction or SQL framework requires just a small percentage of the overhead necessary for an equivalent job written directly in the general-purpose framework. Also, running a query on a special-purpose processing framework (e.g., Impala, or Presto for SQL) is much faster than running an equivalent MapReduce job, because they use a completely different execution model, built for running fast SQL queries.
Two exceptions where you should use a general-purpose framework:

If you have certain information about the data (i.e. metadata) that can’t be expressed and taken advantage of in an abstraction or SQL framework. For example, let’s say that your data set is partitioned or sorted in a particular way that you cannot express when creating a logical data set in an abstraction or SQL framework. However, making use of such partitioning/sorting metadata in your job can speed up the processing. In such a case, it makes sense to directly program within the low-level API of a general-purpose processing framework. In such cases, the time savings in running a job over and over again more than pays off for the extra development time.
If your use case is particularly suited to a general-purpose framework. This is usually a small percentage of use cases where the analysis is very complex and can’t be easily expressed in a DSL like SQL or Pig Latin. In these cases, Crunch and Cascading should be considered, but oftentimes you might just have to directly program using a general-purpose processing framework.
Once you have decided on using an abstraction or SQL framework, which particular framework you use usually depends on the expertise and experience you have in-house.

Graph, machine learning, and real-time/streaming frameworks

There is usually no need to convince users to adopt graph, machine learning, and real-time/streaming frameworks. If a specific use case is important you, you will likely need to use a framework that solves that use case.

Graph frameworks

Giraph, GraphX, and GraphLab are popular graph processing frameworks.

Apache Giraph is a library that runs on top of MapReduce.
GraphX is a library for graph processing on Spark.
GraphLab was a stand-alone, special-purpose graph processing framework that can now also handle tabular data.
Machine-learning frameworks

Mahout, MLlib, Oryx, and H2O are commonly used machine learning frameworks.

Mahout is a library on top of MapReduce, although there are plans to make Mahout work on Spark.
MLlib is a machine learning library for Spark.
Oryx and H2O are stand-alone, special-purpose machine learning engines.
Real-time/streaming frameworks

For near real-time analysis of data, Spark Streaming and Storm + Trident are commonly used frameworks.

Spark Streaming is a library for doing micro-batch streaming analysis, built on top of Spark.
Apache Storm is a special-purpose, distributed, real-time computation engine with Trident used as an abstraction engine on top of it.
Conclusion

The Hadoop ecosystem has evolved to the point where using MapReduce is no longer the only way to query data in Hadoop. With the breadth of options now available, it can be tough to choose which framework to use for processing your Hadoop data.

Most users adopt more than one framework for processing their Hadoop data, and this makes having resource management in your Hadoop cluster extremely important. A common data pipeline begins with ingestion; followed by ETL, which is done by a general-purpose engine, an abstraction engine, or a combination thereof; followed by one of the many special-purpose engines for doing low-latency SQL, machine learning, or graph processing.

I hope this post helps when you’re deciding which processing framework(s) to use. Happy Hadooping!

openstack and hadoop

http://web.stackiq.com/blog/adding-value-not-complexity?utm_campaign=Blog&utm_source=hs_email&utm_medium=email&utm_content=15481429&_hsenc=p2ANqtz-8LgLz5OTKOeVpM287Ob_Muec9nPPCWc_nJZq_iD4AaGzipXeFC7g56RqEwRK3llJFwMWcLnGidza-DR_x1Zq0B5tSz6rbp4pfBp6B3j92-RzLSsGk&_hsmi=15481429

MapReduce Program using Eclipse and Hadoop 2.6.0

Pre-reqisite:

- Single node Hadoop should be up and runing

Please follow my previous blog on single node Hadoop setup if it not ready for you

Hadoop Single Node Setup

- Down load and install Eclipse from below location if Eclipse does not exist

wget http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-jee-kepler-SR2-linux-gtk-x86_64.tar.gz

bhupendra@ubuntu:/home/hduser/eclipse$ ./eclipse
Step 1:
Start Eclipse and create New Java  Project as below


Step 2: Write Wordcount Driver class


Step 3: Replace below code from autogenerated WordCount.java class

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class WordCount extends Configured implements Tool{
      public int run(String[] args) throws Exception
      {
            //creating a JobConf object and assigning a job name for identification purposes
            JobConf conf = new JobConf(getConf(), WordCount.class);
            conf.setJobName("WordCount");

            //Setting configuration object with the Data Type of output Key and Value
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);

            //Providing the mapper and reducer class names
            conf.setMapperClass(WordCountMapper.class);
            conf.setReducerClass(WordCountReducer.class);
            //conf.setMapperClass(WordCountMapper.class);
            //conf.setMapperClass(WordCountReducer.class);
            //We wil give 2 arguments at the run time, one in input path and other is output path
            Path inp = new Path(args[0]);
            Path out = new Path(args[1]);
            //the hdfs input and output directory to be fetched from the command line
            FileInputFormat.addInputPath(conf, inp);
            FileOutputFormat.setOutputPath(conf, out);

            JobClient.runJob(conf);
            return 0;
      }
   
      public static void main(String[] args) throws Exception
      {
            // this main function will call run method defined above.
        int res = ToolRunner.run(new Configuration(), new WordCount(),args);
            System.exit(res);
      }
}


Step 4: Create new WordCountMapper class

Replace with below codes

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
      //hadoop supported data types
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
   
      //map method that performs the tokenizer job and framing the initial key value pairs
      // after all lines are converted into key-value pairs, reducer is called.
      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            //taking one line at a time from input file and tokenizing the same
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
       
          //iterating through all the words available in that line and forming the key value pair
            while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
               //sending to output collector which inturn passes the same to reducer
                 output.collect(word, one);
            }
       }
}


Step 5: Create WordCountReducer Class
And replace with below code

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
      //reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            int sum = 0;
            /*iterates through all the values available with a key and add them together and give the
            final result as the key and sum of its values*/
          while (values.hasNext())
          {
               sum += values.next().get();
          }
          output.collect(key, new IntWritable(sum));
      }
}

Step 6: If Above java classes are not having any syntax error, the corresponding class file will be generated automatically as follows:



Step 7:
Additionally before step 6,  we have to add dependencies by  adding external libraries from hadoop
Follow the below screenshots and added external jar from path
( in my case its /usr/local/hadoop-2.6.0/share/hadoop/common and /usr/local/hadoop-2.6.0/share/hadoop/mapreduce )

{HADOOP_HOME}/share/hadoop/common
{HADOOP_HOME}/share/hadoop/common/lib
{HADOOP_HOME}/share/hadoop/mapreduce
{HADOOP_HOME}/share/hadoop/yarn
{HADOOP_HOME}/share/hadoop/hdfs


Step 8:
Now Click on the Run tab and click Run-Configurations. Click on New Configuration button and fill the Name, Project Name and Main Class per screen-shots






Step 9:
Now right click on project and  select Export. under Java, select Runnable Jar.
In Launch Config - select the config fie you created in Step 8  (WordCountConfig).
Select an export destination ( lets say desktop.)
Under Library handling, select Extract Required Libraries into generated JAR and click Finish.




Step 10:

- Switch to hduser $sudo su hduser

- Remove temp file generated to gracefully start all required hadoop deamons like namenode, datanode, resourcemange, applicaiton manager, 2ndory Name node.


temp file location is based on tmp file location defined in one of the hadoop configuration file core-site.xml

Step 11: Format name node using below command

#hadoop namenode -format and output will be something like below

Step 12: start process and check if required deamon has been started gracefully or not. Refer below screen and commands for the same

Please note, if temp files are there and not removed, few of deamons will not be started properly. 

Step 13:
Make a hdfs directory ( Note: These directories are not listed when ls is used in the terminal and they are also not visible in the File Browser ) -  hadoop dfs -mkdir -p /usr/local/hadoop-2.6.0/input
Copy the sample input text file into this hdfs directory -   hadoop dfs -copyFromLocal /home/bhupendra/workspace/sample1.txt /usr/local/hadoop-.2.6.0/input
Change directory to run an example Wordcount program using jar file. NOTE: Don't create output folder out1, it  will be created and every time you run an example, give a new directory. These directories are not visible with ls command in terminal.
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output

<< I will fix the above issue latter, as this issue causing unable to run hadoop command to create directory and copyfile from local to hdfs director"input"
to run the programme, I have created input directory using usula mkdir command and copied file using cp command >>

Error:
hadoop fs -ls
15/01/30 17:03:49 WARN util.NativeCodeLoader: Unable to load native-hadoop 
ibrary for your platform... using builtin-java classes where applicable
ls: `.': No such file or directory
Fix1
well, your problem regarding ls: '.': No such file or directory' is because there is not home dir on HDFS for your current user. Try
hadoop fs -mkdir -p /user/[current login user]
Then you will be able to hadoop fs -ls
Fix2
go to hadoop conf path
hduser@ubuntu:/usr/local/hadoop-2.6.0/etc/hadoop
vi hadoop-en.sh and add following lines

export HADOOP_PREFIX=/usr/local/hadoop-2.6.0
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

< Please note hadoop-env.sh environment variable overrides variables in side .bashrc file. Hence it is mandatory to add above lines in hadoop-env.sh file>

Step 14:
run the job using below command 
hadoop jar WordCount.jar /usr/local/hadoop-2.6.0/input /usr/local/hadoop-2.6.0/output


Step 15
Browse the Hadoop GUI

http://localhost:50070/dfshealth.html#tab-overview

Step 16:
Browse the output file 
http://localhost:50070/explorer.html#/usr/local/hadoop-2.6.0/output

Step 17:
Stop all deamons if you are done with job
http://localhost:8088/cluster

hduser@ubuntu:/usr/local/hadoop-2.6.0/etc/hadoop$ hadoop fs -ls hdfs://localhost:54310
Found 1 items
drwxr-xr-x   - hduser supergroup          0 2015-07-17 10:42 hdfs://localhost:54310/user/hduser/input
hduser@ubuntu:/usr/local/hadoop-2.6.0/etc/hadoop$ 

Hadoop Single Node Setup

System requirement:


1. mkdir /usr/local/hadoop-2.6.0

2. cd /usr/local/hadoop-2.6.0

3. wget http://mirror.metrocast.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

4. tar -xzfv hadoop-2.6.0.tar.gz

5. add new user
 
     $ usergroup hadoop
     $ useradd -g hadoop hduser
 to change primary group
usermod -g primarygrpname username
to change secondary group
usermod -G secondarygrpname username

6. Install ssh-server
  $ apt-get install openssh-server

7. generate ssh key
$ su - hduser
$ ssh-key gen
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ ssh hduser@localhost






  • Disabling IPv6
  • Open config file: sudo gedit /etc/sysctl.conf
  • Add these 3 lines at the end of the file: 
  • #disable ipv6; net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
    • after adding the following code, reload the settings  using-  source  ~/.bashrc  and   source  ~/.profile

    • Configuring hadoop Configuration file
            Change directory using cd /usr/local/hadoop/etc/hadoop
       
        $ vi yarn-site.xml
         
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
    </property>
    </configuration
    • Create direcotry as below
    • Format file system 
    • cd /usr/local/hadoop-2.6.0
    • ./hadoop namenode -format

    • Go to sbin and start all demons
    • cd  /usr/local/hadoop-2.6.0/sbin
    • $ ./start-all.sh
    • to check if all demons are running
    • $jps

    If any daemon doesn't start, start them manually
        hadoop-daemon.sh start namenode
        hadoop-daemon.sh start datanode
        yarn-daemon.sh start resourcemanager
        yarn-daemon.sh start nodemanager
        mr-jobhistory-daemon.sh start historyserver

        Hadoop Web Interfaces.
            Namenode - http://localhost:50070/
            Secondary Namenode - http://localhost:50090
            Most important is jps. Use jps to check which daemons are running.


    $chown -R hduser:hadoop /usr/local/hadoop-2.6.0
    $chmod +x -R /usr/local/hadoop-2.6.0
    Setting Global Variable
    $ vi /home/hduser/.bashrc
    export HADOOP_PREFIX=/usr/local/hadoop-2.6.0
    export HADOOP_HOME=/usr/local/hadoop-2.6.0
    export HADOOP_MAPRED_HOME=${HADOOP_HOME}
    export HADOOP_COMMON_HOME=${HADOOP_HOME}
    export HADOOP_HDFS_HOME=${HADOOP_HOME}
    export YARN_HOME=${HADOOP_HOME}
    export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
    # Native Path
    export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
    #Java path
    export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
    # Add Hadoop bin/ directory to PATH

    export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

    $vi /home/hduser/.profile
    export HADOOP_PREFIX=/usr/local/hadoop-2.6.0
    export HADOOP_HOME=/usr/local/hadoop-2.6.0
    export HADOOP_MAPRED_HOME=${HADOOP_HOME}
    export HADOOP_COMMON_HOME=${HADOOP_HOME}
    export HADOOP_HDFS_HOME=${HADOOP_HOME}
    export YARN_HOME=${HADOOP_HOME}
    export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
    # Native Path
    export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
    #Java path
    export JAVA_HOME='/usr/lib/jvm/java-7-oracle'
    # Add Hadoop bin/ directory to PATH

    export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

    $/usr/local/hadoop-2.6.0/etc/hadoop/hadoop-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-7-oracle



         <configuration>
    <!-- Site specific YARN configuration properties -->
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    </configuration>

    $vi core-site.xml
    <configuration>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/usr/local/hadoop-2.6.0/tmp</value>
      <description>A base for other temporary directories.</description>
    </property>

    <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:54310</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>

    </configuration>

    $  vi mapred-site.xml
    <configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:54311</value>
      <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.
      </description>
    </property>

    </configuration>

    $ vi hdfs-site.xml
    mkdir -p $HADOOP_HOME/yarn_data/hdfs/datenode
    mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode

    Offline Image Viewer Guide

    -rw-r--r-- 1 hduser hadoop  100722 Oct  7 20:49 fsimage_0000000000000008804
    -rw-r--r-- 1 hduser hadoop      62 Oct  7 20:49 fsimage_0000000000000008804.md5
    drwxrwxr-x 3 hduser hduser    4096 Oct  8 22:49 ..
    -rw-r--r-- 1 hduser hadoop  100722 Oct  8 22:49 fsimage_0000000000000008805
    -rw-r--r-- 1 hduser hadoop      62 Oct  8 22:49 fsimage_0000000000000008805.md5
    -rw-rw-r-- 1 hduser hduser     202 Oct  8 22:49 VERSION
    -rw-r--r-- 1 hduser hadoop       5 Oct  8 22:49 seen_txid
    -rw-r--r-- 1 hduser hadoop 1048576 Oct  8 22:49 edits_inprogress_0000000000000008806
    drwxrwxr-x 2 hduser hduser   12288 Oct  8 22:49 .
    hduser@ubuntu:/usr/local/hadoop-2.6.0/tmp/dfs/name/current$ cat fsimage_0000000000000008805.md5
    929bde84fb1432baba3228dc78b3b6d8 *fsimage_0000000000000008805
    hduser@ubuntu:/usr/local/hadoop-2.6.0/tmp/dfs/name/current$ hdfs oiv -i fsimage_0000000000000008805
    15/10/08 23:02:31 INFO offlineImageViewer.FSImageHandler: Loading 2 strings
    15/10/08 23:02:31 INFO offlineImageViewer.FSImageHandler: Loading 1273 inodes.
    15/10/08 23:02:31 INFO offlineImageViewer.FSImageHandler: Loading inode references
    15/10/08 23:02:31 INFO offlineImageViewer.FSImageHandler: Loaded 0 inode references
    15/10/08 23:02:31 INFO offlineImageViewer.FSImageHandler: Loading inode directory section
    15/10/08 23:02:31 INFO offlineImageViewer.FSImageHandler: Loaded 164 directories
    15/10/08 23:02:31 INFO offlineImageViewer.WebImageViewer: WebImageViewer started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.
    15/10/08 23:04:27 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=GETFILESTATUS target=/user/hduser
    15/10/08 23:04:27 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/user/hduser
    15/10/08 23:04:51 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=GETFILESTATUS target=/user/hduser
    15/10/08 23:04:51 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/user/hduser
    15/10/08 23:05:41 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=GETFILESTATUS target=/user/hduser
    15/10/08 23:05:42 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/user/hduser
    15/10/08 23:05:42 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/user/hduser/input
    15/10/08 23:05:42 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/user/hduser/input1
    15/10/08 23:05:42 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/user/hduser/input2
    15/10/08 23:05:42 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/user/hduser/input3
    15/10/08 23:06:47 INFO offlineImageViewer.FSImageHandler: 200 method=GET op=LISTSTATUS target=/