Monday, February 16, 2015

MapReduce Program using Eclipse and Hadoop 2.6.0

Pre-reqisite:

- Single node Hadoop should be up and runing

Please follow my previous blog on single node Hadoop setup if it not ready for you

Hadoop Single Node Setup

- Down load and install Eclipse from below location if Eclipse does not exist

wget http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-jee-kepler-SR2-linux-gtk-x86_64.tar.gz

bhupendra@ubuntu:/home/hduser/eclipse$ ./eclipse
Step 1:
Start Eclipse and create New Java  Project as below


Step 2: Write Wordcount Driver class


Step 3: Replace below code from autogenerated WordCount.java class

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class WordCount extends Configured implements Tool{
      public int run(String[] args) throws Exception
      {
            //creating a JobConf object and assigning a job name for identification purposes
            JobConf conf = new JobConf(getConf(), WordCount.class);
            conf.setJobName("WordCount");

            //Setting configuration object with the Data Type of output Key and Value
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);

            //Providing the mapper and reducer class names
            conf.setMapperClass(WordCountMapper.class);
            conf.setReducerClass(WordCountReducer.class);
            //conf.setMapperClass(WordCountMapper.class);
            //conf.setMapperClass(WordCountReducer.class);
            //We wil give 2 arguments at the run time, one in input path and other is output path
            Path inp = new Path(args[0]);
            Path out = new Path(args[1]);
            //the hdfs input and output directory to be fetched from the command line
            FileInputFormat.addInputPath(conf, inp);
            FileOutputFormat.setOutputPath(conf, out);

            JobClient.runJob(conf);
            return 0;
      }
   
      public static void main(String[] args) throws Exception
      {
            // this main function will call run method defined above.
        int res = ToolRunner.run(new Configuration(), new WordCount(),args);
            System.exit(res);
      }
}


Step 4: Create new WordCountMapper class

Replace with below codes

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
      //hadoop supported data types
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
   
      //map method that performs the tokenizer job and framing the initial key value pairs
      // after all lines are converted into key-value pairs, reducer is called.
      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            //taking one line at a time from input file and tokenizing the same
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
       
          //iterating through all the words available in that line and forming the key value pair
            while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
               //sending to output collector which inturn passes the same to reducer
                 output.collect(word, one);
            }
       }
}


Step 5: Create WordCountReducer Class
And replace with below code

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
      //reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            int sum = 0;
            /*iterates through all the values available with a key and add them together and give the
            final result as the key and sum of its values*/
          while (values.hasNext())
          {
               sum += values.next().get();
          }
          output.collect(key, new IntWritable(sum));
      }
}

Step 6: If Above java classes are not having any syntax error, the corresponding class file will be generated automatically as follows:



Step 7:
Additionally before step 6,  we have to add dependencies by  adding external libraries from hadoop
Follow the below screenshots and added external jar from path
( in my case its /usr/local/hadoop-2.6.0/share/hadoop/common and /usr/local/hadoop-2.6.0/share/hadoop/mapreduce )

{HADOOP_HOME}/share/hadoop/common
{HADOOP_HOME}/share/hadoop/common/lib
{HADOOP_HOME}/share/hadoop/mapreduce
{HADOOP_HOME}/share/hadoop/yarn
{HADOOP_HOME}/share/hadoop/hdfs


Step 8:
Now Click on the Run tab and click Run-Configurations. Click on New Configuration button and fill the Name, Project Name and Main Class per screen-shots






Step 9:
Now right click on project and  select Export. under Java, select Runnable Jar.
In Launch Config - select the config fie you created in Step 8  (WordCountConfig).
Select an export destination ( lets say desktop.)
Under Library handling, select Extract Required Libraries into generated JAR and click Finish.




Step 10:

- Switch to hduser $sudo su hduser

- Remove temp file generated to gracefully start all required hadoop deamons like namenode, datanode, resourcemange, applicaiton manager, 2ndory Name node.


temp file location is based on tmp file location defined in one of the hadoop configuration file core-site.xml

Step 11: Format name node using below command

#hadoop namenode -format and output will be something like below

Step 12: start process and check if required deamon has been started gracefully or not. Refer below screen and commands for the same

Please note, if temp files are there and not removed, few of deamons will not be started properly. 

Step 13:
Make a hdfs directory ( Note: These directories are not listed when ls is used in the terminal and they are also not visible in the File Browser ) -  hadoop dfs -mkdir -p /usr/local/hadoop-2.6.0/input
Copy the sample input text file into this hdfs directory -   hadoop dfs -copyFromLocal /home/bhupendra/workspace/sample1.txt /usr/local/hadoop-.2.6.0/input
Change directory to run an example Wordcount program using jar file. NOTE: Don't create output folder out1, it  will be created and every time you run an example, give a new directory. These directories are not visible with ls command in terminal.
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output

<< I will fix the above issue latter, as this issue causing unable to run hadoop command to create directory and copyfile from local to hdfs director"input"
to run the programme, I have created input directory using usula mkdir command and copied file using cp command >>

Error:
hadoop fs -ls
15/01/30 17:03:49 WARN util.NativeCodeLoader: Unable to load native-hadoop 
ibrary for your platform... using builtin-java classes where applicable
ls: `.': No such file or directory
Fix1
well, your problem regarding ls: '.': No such file or directory' is because there is not home dir on HDFS for your current user. Try
hadoop fs -mkdir -p /user/[current login user]
Then you will be able to hadoop fs -ls
Fix2
go to hadoop conf path
hduser@ubuntu:/usr/local/hadoop-2.6.0/etc/hadoop
vi hadoop-en.sh and add following lines

export HADOOP_PREFIX=/usr/local/hadoop-2.6.0
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

< Please note hadoop-env.sh environment variable overrides variables in side .bashrc file. Hence it is mandatory to add above lines in hadoop-env.sh file>

Step 14:
run the job using below command 
hadoop jar WordCount.jar /usr/local/hadoop-2.6.0/input /usr/local/hadoop-2.6.0/output


Step 15
Browse the Hadoop GUI

http://localhost:50070/dfshealth.html#tab-overview

Step 16:
Browse the output file 
http://localhost:50070/explorer.html#/usr/local/hadoop-2.6.0/output

Step 17:
Stop all deamons if you are done with job
http://localhost:8088/cluster

hduser@ubuntu:/usr/local/hadoop-2.6.0/etc/hadoop$ hadoop fs -ls hdfs://localhost:54310
Found 1 items
drwxr-xr-x   - hduser supergroup          0 2015-07-17 10:42 hdfs://localhost:54310/user/hduser/input
hduser@ubuntu:/usr/local/hadoop-2.6.0/etc/hadoop$ 

9 comments:

  1. Please mention which local files should be copied to the input directory.

    ReplyDelete
  2. Map reducing concept is the important one in hadoop.In this blog it explains clearly and easy to understand.
    Besant Technologies Reviews | Besant Technologies Reviews

    ReplyDelete
  3. Its very helpful and simple tutorial ..

    ReplyDelete
  4. Thanks for sharing this article.. You may also refer http://www.s4techno.com/blog/2016/07/11/hadoop-administrator-interview-questions/..

    ReplyDelete
  5. Great site for these post and i am seeing the most of contents have useful for my Carrier.Thanks to such a useful information.Any information are commands like to share him.
    CCNA Training in Chennai

    ReplyDelete
  6. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete