Hadoop/Bigdata: July 2013

Thursday, 25 July 2013

Hadoop file manipulations in java

            Configuration config = new Configuration();
            FileSystem hdfs = FileSystem.get(config);
            Path path = new Path(fileName);
            FileStatus fileStatus = hdfs.getFileStatus(path);

In order to get the last modification and access time of a file in Hadoop
file system:

            long modificationTime = fileStatus.getModificationTime();
            long accessTime = fileStatus.getAccessTime();

In order to get the replication and block size of a file in Hadoop
file system:

            short replica = fileStatus.getReplication();
            long blockSize = fileStatus.getBlockSize();

In order to get the group and owner of a file in Hadoop
file system:

            String group= fileStatus.getGroup();
            String owner =fileStatus.getOwner();

List All Files in the directory:
            if (fileStatus.isDir())
            {
              FileStatus[] status = fs.listStatus(

path);
              for (int i=0;i<status.length;i++){ 
                 Path cur =

status[i].getPath();

                 System.out.println("cur.toUri().getPath()");             
               }
            }else{

System.out.println(fileName+": is not a directory");
}

Tuesday, 23 July 2013

List Data nodes in hadoop cluster

List Data nodes in hadoop cluster Java code:
    Configuration conf = new Configuration();
      try{
   FileSystem fs = FileSystem.get(conf);
         BufferedWriter buffWrite = new BufferedWriter(new FileWriter(file));

        DistributedFileSystem hdfs = (DistributedFileSystem) fs;
        DatanodeInfo[] dataNodeStats = hdfs.getDataNodeStats();
                         for (int i = 0; i < dataNodeStats.length; i++) {
                    System.out.println(dataNodeStats[i].getHost());
                }
        }catch(IOException e) {
            logger.error("IOException when writing the node list to file \t" + e.toString());
        }

Monday, 22 July 2013

Set replication factor in HDFS

In the command line:
For existing file in HDFS:
To set replication of an individual file to 4:
hadoop fs -setrep -w 4 /hdfs/path/tofile

You can also do this for directory recursively.
To set replication for a directory to 1:
hadoop fs -setrep -R -w 1 /hdfs/path/toDirectory
To change replication of entire HDFS to 2:
hadoop fs -setrep -R -w 2 /
To copy new file with replication 2:
hadoop fs -D dfs.replication=2 -copyFromLocal /local/path/tofile /hdfs/path/tofile

In Java program: For a file

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path hdfsPath = new Path("/hdfs/hdfsFile");
short replication =2;
fs.setReplication(hdfsPath, replication);

returns 'true' if successful;
'false' if file does not exist or is a directory

Saturday, 20 July 2013

Hadoop map reduce new API

Hadoop map reduce API changed starting from Hadoop 0.20.x

Older API: - org.apache.hadoop.hbase.mapred
Newer API- org.apache.hadoop.hbase.mapreduce

The package org.apache.hadoop.mapred.* have been deprecated.
The current Map Reduce tutorial on the apache Hadoop page is written for the old API.

WordCount program with new API

import java.io.IOException;
import java.lang.InterruptedException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
/**
* The map class of WordCount.
*/
public static class TokenCounterMapper     extends Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException
    {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}
/**
* The reducer class of WordCount
*/
public static class TokenCounterReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        context.write(key, new IntWritable(sum));
    }
}
/**
* The main entry point.
*/
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenCounterMapper.class);
    job.setReducerClass(TokenCounterReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Compile WordCount.java:
mkdir wordcount_classes
$ javac -cp classpath -d wordcount_classes WordCount.java

where classpath is: CDH4 -/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*
In apache hadoop:     ${HADOOP_HOME}/hadoop-core-1.1.2.jar

Create a JAR
$jar -cvf wordcount.jar -C wordcount_classes/ .

Execute program:
$hadoop jar wordcount.jar WordCount /wordcount/input /wordcount/output