大数据主题

Hadoop - 编程和安装

上页

前面我们已经了解了Hadoop的工作机制和原理,这章以wordcount也就是单词计数这个案例来看看如何开发具体java程序。具体代码如下:

public class SimpleWordCountMapperextends MapReduceBase implements
  Mapper<LongWritable, Text, Text, IntWritable> {

  //输出的(k,v)对象
  static final Text word = new Text();
  static final IntWritable one = new IntWritable(1);

  @Override
  public void map(LongWritable key, Text documentContents,
    OutputCollector<Text, IntWritable> collector, Reporter reporter)
    throws IOException {
  String[] tokens = documentContents.toString().split("\\s+");//以空格分隔单词
  for (String wordString : tokens) {
    if (wordString.length() > 0) {
      word.set(wordString.toLowerCase());
      collector.collect(word, one);
    }
  }
}
}

注意到Mapper<LongWritable, Text, Text, IntWritable>这个类有四个类型参数,输入和输出,具体可见map方法参数,LongWritable key, Text documentContents是输入,OutputCollector<Text, IntWritable> collector, Reporter reporter是Map处理的输出。

下面再看看reduce代码:

public class SimpleWordCountReducer
  extends MapReduceBase implements
  Reducer<Text, IntWritable, Text, IntWritable> {

   @Override
  public void reduce(Text key, Iterator<IntWritable> counts,
    OutputCollector<Text, IntWritable> output, Reporter reporter)
    throws IOException {
      int count = 0;
      while (counts.hasNext()) {
        count += counts.next().get();//统计单个词的次数。
      }
      output.collect(key, new IntWritable(count));//发射(word,count)
    }
}

Reduce代码中Reducer是输入和输出类型,输入是Text key, Iterator<IntWritable> counts,输出OutputCollector<Text, IntWritable> output, Reporter reporter。

单元测试Hadoop的Job任务

  MRUnit framework是一个测试Hadoop Job的测试框架,Mapper测试代码案例:

public class MapperTest {

    MapDriver<Text, Text, Text, Text> mapDriver;

    @Before
    public void setUp() {
        WordMapper mapper = new WordMapper();
        mapDriver = MapDriver.newMapDriver(mapper);
    }

    @Test
    public void testMapper() throws IOException {
        mapDriver.withInput(new Text("a"), new Text("ein"));
        mapDriver.withInput(new Text("a"), new Text("zwei"));
        mapDriver.withInput(new Text("c"), new Text("drei"));
        mapDriver.withOutput(new Text("a"), new Text("ein"));
        mapDriver.withOutput(new Text("a"), new Text("zwei"));
        mapDriver.withOutput(new Text("c"), new Text("drei"));
        mapDriver.runTest();
    }
}

Reducer测试代码案例:

public class ReducerTest {

    ReduceDriver<Text, Text, Text, Text> reduceDriver;

    @Before
    public void setUp() {
        AllTranslationsReducer reducer = new AllTranslationsReducer();
        reduceDriver = ReduceDriver.newReduceDriver(reducer);
    }

    @Test
    public void testReducer() throws IOException {
        List<Text> values = new ArrayList<Text>();
        values.add(new Text("ein"));
        values.add(new Text("zwei"));
        reduceDriver.withInput(new Text("a"), values);
        reduceDriver.withOutput(new Text("a"), new Text("|ein|zwei"));
        reduceDriver.runTest();
    }
}

Maven依赖配置:

<dependency>
   <groupId>org.apache.mrunit</groupId>
   <artifactId>mrunit</artifactId>
   <version>1.0.0</version>
   <classifier>hadoop1</classifier>
   <scope>test</scope>
</dependency>

运行mvn clean test如下图:

 

使用Maven打包Hadoop应用

  我们的mapreduce java类编写好后,要打包成jar包,编译打包需要Hadoop依赖包,可以使用Maven能解决依赖。

安装好Maven以后,运行下面命令创建一个空白的pom.xml:"

mvn archetype:generate -DgroupId=com.jordiburgos -DartifactId=wordcount
 -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

加入Hadoop的依赖:

<repositories>
 <repository>
  <id>cloudera</id>
  <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
 </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.2.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>1.2.1</version>
    </dependency>

    <dependency>
        <groupId>jdk.tools</groupId>
        <artifactId>jdk.tools</artifactId>
        <version>1.7</version>
        <scope>system</scope>
        <systemPath>${java.home}/../lib/tools.jar</systemPath>
    </dependency>
   
</dependencies>

创建上传Hadoop的Jar包,输出目录是lib:

<fileSets>
    <fileSet>
        <directory>${project.build.directory}</directory>
        <outputDirectory>/</outputDirectory>
        <includes>
            <include>*.jar</include>
        </includes>
    </fileSet>
</fileSets>

<dependencySets>
    <dependencySet>
        <useProjectArtifact>false</useProjectArtifact>
        <outputDirectory>/lib</outputDirectory>
    </dependencySet>
</dependencySets>

 

hadoop的Linux下简单安装步骤

首先安装Hadoop cluster版本:
$ curl -O http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u6.tar.gz
$ cd /usr/local
$ tar xzvf /path/to/hadoop-0.20.2-cdh3u6.tar.gz
$ ln -s hadoop-0.20.2-cdh3u6 hadoop

配置工作:
$ cd hadoop
$ mv conf conf.original
$ cp -Rp example-confs/conf.pseudo conf

将/usr/local/hadoop/bin加入PATH环境
$ export PATH=/usr/local/hadoop/bin:$PATH

格式化HDFS
建立临时文件目录 /var/lib/hadoop-0.20/
chmod 0777 -R /var/lib/hadoop-0.20/
chmod 0777 -R /usr/local/hadoop/
以普通用户登录,非ROOT用户,然后运行:
hadoop namenode -format

出现:O common.Storage: Storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name has been successfully formatted.
表示OK。

分别运行:
$ hadoop namenode
$ hadoop datanode
$ hadoop jobtracker
$ hadoop tasktracker

下页

Hadoop专题

Storm大数据实时处理

Hive架构

NoSQL

NOSQL之旅---HBase

Big Data 辩论: HBase会统治NoSQL世界吗?

大数据专题