Thursday, 25 July 2013

Hadoop file manipulations in java

  



            Configuration config = new Configuration();
            FileSystem hdfs = FileSystem.get(config);
            Path path = new Path(fileName);
            FileStatus fileStatus = hdfs.getFileStatus(path);


In order to get the last modification and access time of a file in Hadoop file system:
 
            long modificationTime = fileStatus.getModificationTime();
            long accessTime = fileStatus.getAccessTime();
            


In order to get the replication and block size of a file in Hadoop file system:
            short replica = fileStatus.getReplication();
            long blockSize = fileStatus.getBlockSize();
            


In order to get the group and owner of a file in Hadoop file system:
            String group= fileStatus.getGroup();
            String owner =fileStatus.getOwner();


List All Files in the directory:
            if (fileStatus.isDir())
            {               
              FileStatus[] status = fs.listStatus(path);
              for (int i=0;i<status.length;i++){
                 Path cur =
status[i].getPath(); 
                 System.out.println("cur.toUri().getPath()");            
               }

            }else{

                 System.out.println(fileName+": is not a directory");
            }

Tuesday, 23 July 2013

List Data nodes in hadoop cluster

List Data nodes in hadoop cluster Java code:   
    Configuration conf = new Configuration();
      try{
         FileSystem fs = FileSystem.get(conf);       
         BufferedWriter buffWrite = new BufferedWriter(new FileWriter(file));
          
          DistributedFileSystem hdfs = (DistributedFileSystem) fs;
          DatanodeInfo[] dataNodeStats = hdfs.getDataNodeStats();       
                           for (int i = 0; i < dataNodeStats.length; i++) {                    
                      System.out.println(dataNodeStats[i].getHost());
                  }
          }catch(IOException e) {
              logger.error("IOException when writing the node list to file \t" + e.toString());
          }

Monday, 22 July 2013

Set replication factor in HDFS

In the command line:
For existing file in HDFS:
To set replication of an individual file to 4:
     hadoop fs -setrep -w 4 /hdfs/path/tofile

You can also do this for directory recursively.
 To set replication for a directory to 1:
       hadoop fs -setrep -R -w 1 /hdfs/path/toDirectory
 To change replication of entire HDFS to 2:
    hadoop fs -setrep -R -w 2 /
 To copy new file with replication 2:
      hadoop fs -D dfs.replication=2 -copyFromLocal /local/path/tofile /hdfs/path/tofile

In Java program: For a file
 
 Configuration conf = new Configuration(); 
 FileSystem fs = FileSystem.get(conf);
 Path hdfsPath = new Path("/hdfs/hdfsFile");
 short replication =2;
 fs.setReplication(hdfsPath, replication);

returns 'true' if successful; 'false' if file does not exist or is a directory

Saturday, 20 July 2013

Hadoop map reduce new API

Hadoop map reduce API changed starting from Hadoop 0.20.x

Older API: - org.apache.hadoop.hbase.mapred  
 Newer API- org.apache.hadoop.hbase.mapreduce

The package org.apache.hadoop.mapred.* have been deprecated.
The current Map Reduce tutorial on the apache Hadoop page is written for the old API.

WordCount program with new API

import java.io.IOException;
import java.lang.InterruptedException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
/**
 * The map class of WordCount.
 */
public static class TokenCounterMapper     extends Mapper<Object, Text, Text, IntWritable> {
       
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException
    {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}
/**
 * The reducer class of WordCount
 */
public static class TokenCounterReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        context.write(key, new IntWritable(sum));
    }
}
/**
 * The main entry point.
 */
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenCounterMapper.class);
    job.setReducerClass(TokenCounterReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}


Compile WordCount.java:
mkdir wordcount_classes
$ javac -cp classpath -d wordcount_classes WordCount.java

 where classpath is: CDH4 -/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*
In apache hadoop:     ${HADOOP_HOME}/hadoop-core-1.1.2.jar

 Create a JAR
$jar -cvf wordcount.jar -C wordcount_classes/ .



Execute program:
$hadoop jar wordcount.jar WordCount /wordcount/input /wordcount/output

Type mismatch in key/value from map