Hadoop(二)Hadoop的HelloWorld(单机模式下的安装和使用)

本文已同步至个人博客liaosi's blog-Hadoop(二)Hadoop的HelloWorld(单机模式下的安装和使用)

本文示例使用的VMWare虚拟机,Linux系统版本是CentOS 7_64位,Hadoop的版本是Hadoop 2.8.2,JDK版本是1.8,使用的账号是创建的hadoop账号(参考Hadoop(一)Hadoop的介绍和安装前准备)。
安装Hadoop之前要保证系统已经安装了Java JDK,并配置好了Java环境变量。

Hadoop集群有三种启动模式

  • 单机模式:下载Hadoop在系统中,默认情况下之,Hadoop被配置成以非分布式模式运行的一个独立Java进程。这种模式适合用于调试。
  • 伪分布式模式:Hadoop可以在单节点上以所谓的伪分布式模式运行。此时每一个Hadoop守护进程,如 hdfs, yarn, MapReduce 等,都将作为一个独立的java程序运行。这种模式适合用户开发。
  • 完全分布式模式:即真正的分布式,需要多台独立服务器组成集群。

本文内容即是单机模式的示例。

一.下载Hadoop并配置

1.从官网上 http://hadoop.apache.org/rele... 下载,并解压到服务器的某个目录下(此处我登录的用户是hadoop,解压到${HOME}/app目录下)。
下载Haoop并解压

2.在Hadoop的运行环境配置文件中配置Java的安装目录
编辑 ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh文件,将JAVA_HOME设置为Java安装根路径。

3.配置Hadoop的环境变量
/etc/profile文件中增加:

      export HADOOP_HOME=/opt/hadoop-2.8.1
      export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

比如我的/etc/profile设置成如下图:

4.执行hadoop version命令,验证验证环境变量是否配置成功,正常情况下会看到类似如下的结果:

   [hadoop@server01 hadoop]$ hadoop version
   Hadoop 2.8.2
   Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb
   Compiled by jdu on 2017-10-19T20:39Z
   Compiled with protoc 2.5.0
   From source with checksum dce55e5afe30c210816b39b631a53b1d
   This command was run using /home/hadoop/app/hadoop-2.8.2/share/hadoop/common/hadoop-common-2.8.2.jar
   [hadoop@server01 hadoop]$

二.使用示例

Hadoop自带了一个MapReduce程序$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar,它作为一个例子提供了MapReduce的基本功能,并且可以用于计算,包括 wordcount、terasort、join、grep 等。

以通过执行如下命令查看该.jar文件支持哪些MapReduce功能。

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar
[hadoop@server01 mapreduce]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
[hadoop@server01 mapreduce]$

接下来我们就演示如何使用这个自带的MapReduce程序来计算文件中单词的个数。

1.创建一个目录用来存放我们要处理的数据,可以创建在任何地方(这里我是在/home/hadoop/hadoopdata的目录下创建一个input的目录),并把想要计算分析的文件放到这个目录下(这里我把Hadoop的配置文件复制一份到input目录下)。

cd /home/hadoop/hadoopdata
mkdir input
cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input
ls -l input
[hadoop@server01 hadoopdata]$ cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input
[hadoop@server01 hadoopdata]$ ll input
total 52
-rw-r--r--. 1 hadoop hadoop 4942 Apr 30 11:43 capacity-scheduler.xml
-rw-r--r--. 1 hadoop hadoop 1144 Apr 30 11:43 core-site.xml
-rw-r--r--. 1 hadoop hadoop 9683 Apr 30 11:43 hadoop-policy.xml
-rw-r--r--. 1 hadoop hadoop  854 Apr 30 11:43 hdfs-site.xml
-rw-r--r--. 1 hadoop hadoop  620 Apr 30 11:43 httpfs-site.xml
-rw-r--r--. 1 hadoop hadoop 3518 Apr 30 11:43 kms-acls.xml
-rw-r--r--. 1 hadoop hadoop 5546 Apr 30 11:43 kms-site.xml
-rw-r--r--. 1 hadoop hadoop  871 Apr 30 11:43 mapred-site.xml
-rw-r--r--. 1 hadoop hadoop 1067 Apr 30 11:43 yarn-site.xml
[hadoop@server01 hadoopdata]$ 

2.在这个例子中,我们将 input 文件夹中的所有文件作为输入,筛选当中符合正则表达式 dfs[a-z.]+ 的单词并统计出现的次数,在/home/hadoop/hadoopdata目录下执行如下命令启动Hadoop进程。

    hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dfs[a-z.]+'

执行成功的话,会打印一系列处理的信息,处理的结果会输出到 output 文件夹中,通过命令 cat output/* 查看结果,符合正则的单词 dfsadmin 出现了1次:

    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=123
    File Output Format Counters 
        Bytes Written=23
[hadoop@server01 hadoopdata]$ cat output/*
1   dfsadmin
[hadoop@server01 hadoopdata]$ ll output/
total 4
-rw-r--r--. 1 hadoop hadoop 11 Apr 30 12:51 part-r-00000
-rw-r--r--. 1 hadoop hadoop  0 Apr 30 12:51 _SUCCESS
[hadoop@server04 hadoopdata]$ 

注意,Hadoop 默认不会覆盖结果文件,因此再次运行一个命令并且结果也是输出到output目录则会提示出错,需要先将 output 目录删除

3.删除output目录后我们使用命令在计算一下单词数:

[hadoop@server04 hadoopdata]$ rm -rf output/
[hadoop@server04 hadoopdata]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount input output

查看结果如下:

  File Input Format Counters 
      Bytes Read=26548
  File Output Format Counters 
      Bytes Written=10400
[hadoop@server04 hadoopdata]$ cat output/*
"*" 18
"AS 8
"License"); 8
"alice,bob  18
"clumping"  1
"kerberos".   1
"simple"  1
'HTTP/' 1
'none'  1
'random'    1

这样我们就利用Hadoop自带的MapReduce程序成功地运行了它计算单词个数的功能。

阅读 1.7k

推荐阅读
技术文档整理
用户专栏

后端开发

21 人关注
31 篇文章
专栏主页