hadoop-2.6.0-cdh5.15.0集群环境搭建

梦回唐朝

一、部署前准备

1. 机器数量

测试环境可使用3台服务器进行集群搭建:

主机名称 主机IP 系统
data1 192.168.66.152 CentOS7
data2 192.168.66.153 CentOS7
data3 192.168.66.154 CentOS7

2. 组件版本及下载

组件 版本 下载地址
hadoop hadoop-2.6.0-cdh5.15.0 https://archive.cloudera.com/...
hive hive-1.1.0-cdh5.15.0 https://archive.cloudera.com/...
zookeeper zookeeper-3.4.5-cdh5.15.0 https://archive.cloudera.com/...
hbase hbase-1.2.0-cdh5.15.0 https://archive.cloudera.com/...
kafka kafka_2.12-0.11.0.3 http://kafka.apache.org/downl...
flink flink-1.10.1-bin-scala_2.12 https://flink.apache.org/down...
jdk jdk-8u251-linux-x64 https://www.oracle.com/java/t...

3.集群节点规划

机器名称 服务名称
data1 NameNode、DataNode、ResourceManager、NodeManager、JournalNode、QuorumPeerMain、DFSZKFailoverController、HMaster、HRegionServer、Kafka
data2 NameNode、DataNode、ResourceManager、NodeManager、JournalNode、QuorumPeerMain、DFSZKFailoverController、HMaster、HRegionServer、Kafka
data3 DataNode、NodeManager、HRegionServer、JournalNode、QuorumPeerMain、Kafka

二、开始部署

1.更改hostname

3台服务器的默认主机名为localhost,为了便于后续使用hostname来通信,需要更改下这3台服务器的hostname。
登录服务器,分别更改3台服务器的/etc/hostname文件,给3台机器分别命名为data1、data2、data3。
继续更改3台机器的/etc/hosts文件,添加3台机器的相互映射访问:
image.png
在/etc/hosts文件末尾增加上图中红框部分。
以上更改hostname需要重启机器才能生效。

2.添加hadoop用户和用户组

在3台服务器上专门添加一个名叫hadoop的用户组和用户名,用于操作hadoop集群。

# 添加hadoop用户组
sudo groupadd hadoop

# 添加hadoop用户,并使之属于hadoop用户组
sudo useradd -g hadoop hadoop

# 给haoop用户设置密码
sudo passwd hadoop

# 给hadoop用户添加sudo权限,编辑/etc/sudoers文件
sudo vi /etc/sudoers

# 在"root    ALL=(ALL)     ALL"的下面增加一行
hadoop  ALL=(ALL)       ALL

# 切换到刚刚添加的hadoop用户,后续集群安装过程中都是用该hadoop用户来操作
su hadoop

3.ssh免密配置

在Hadoop集群安装过程中,需要多次将配置好的服务包分发到其他机器上,为了避免每次ssh都需要输入密码,可配置ssh免密登录。

# 在data1机器上使用ssh-keygen生成公钥/私钥
# -t 指定rsa加密算法
# -P 表示密码,-P '' 就表示空密码;也可以不用-P参数,这样就要三车回车,用-P就一次回车
# -f 指定秘钥生成的文件路径
ssh-keygen  -t rsa -P '' -f ~/.ssh/id_rsa

# 进入到.ssh目录,可以看到该目录下有id_rsa(私钥)和id_rsa.pub(公钥)
cd ~/.ssh

# 将公钥拷贝到一个authorized_keys文件中
cat id_rsa.pub >> authorized_keys

# 将上面生成的authorized_keys分别拷贝到data2和data3主机上
scp ~/.ssh/authorized_keys hadoop@data2:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys hadoop@data3:~/.ssh/authorized_keys

# 将authorized_keys的权限设置为600
chmod 600 ~/.ssh/authorized_keys

# 验证ssh免密登录是否设置成功
# 在data1机器上ssh data2或ssh data3时,不再提示输入密码,则表示ssh免密登录设置成功
ssh data2
ssh data3

4.关闭防火墙

因hadoop集群都是在内网环境部署,为了避免在部署过程中出现一些奇怪问题,建议将防火墙事先关闭

# 查看防火墙状态(也可使用systemctl status firewalld命令查看)
firewall-cmd --state

# 临时关闭防火墙,重启后不生效
sudo systemctl stop firewalld

# 开机禁止启动防火墙
sudo systemctl disable firewalld

5.服务器时间同步配置

因为在集群环境中,有些服务是需要服务器进行时间同步的,特别是HBase服务,如果3台机器的时间相差太大,HBase服务启动会报错,故需要事先配置服务器时间同步。时间同步方式有ntp和chrony方式(推荐使用chrony),在centOS7下,默认已经安装了chrony,只需要增加配置即可。

5.1 Chrony服务端配置

我们将data1机器作为chrony的服务端,另外两台机器(data2、data3)作为chrony客户端,即data2和data3机器将会从data1机器上进行同步时间。

# 登录data1机器
# 编辑/etc/chrony.conf
sudo vi /etc/chrony.conf

# 注释掉默认的时间同步服务器配置
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst

# 增加一行自己的时间同步服务配置
# 该IP为data1机器的IP,表示以自身的机器时间为准进行同步(不能访问外网的情况下可使用这种方式)
# 也可以配置成阿里云的时间同步服务器地址(需要能访问外网),如下:
# server ntp1.aliyun.com iburst 
# server ntp2.aliyun.com iburst
# server ntp3.aliyun.com iburst
# server ntp4.aliyun.com iburst
server 192.168.66.152 iburst

# 设置允许被同步时间的机器IP网段
allow 192.168.66.0/24

# 设置时间同步服务级别
local stratum 10

# 重启chrony服务
sudo systemctl restart chronyd.service

# 将chrony服务设为开机启动
sudo systemctl enable chronyd.service

# 查看chrony服务状态
systemctl status chronyd.service

5.2 Chrony客户端配置

在data2和data3机器上进行操作:

# 登录data1和data2机器
# 编辑/etc/chrony.conf
sudo vi /etc/chrony.conf

# 注释掉默认的时间同步服务器配置
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst

# 增加一行自己的时间同步服务配置
# 该IP为data1机器的IP,表示将同步data1机器的时间
server 192.168.66.152 iburst

# 重启chrony服务
sudo systemctl restart chronyd.service

# 将chrony服务设为开机启动
sudo systemctl enable chronyd.service

# 查看chrony服务状态
systemctl status chronyd.service

5.3 查看是否同步成功

# 可使用timedatectl命令查看是否同步成功,分别在data1、data2、data3机器上查看
timedatectl

# 命令返回如下信息:
      Local time: Wed 2020-06-17 18:46:41 CST
  Universal time: Wed 2020-06-17 10:46:41 UTC
        RTC time: Wed 2020-06-17 10:46:40
       Time zone: Asia/Shanghai (CST, +0800)
     NTP enabled: yes
NTP synchronized: yes  (同步成功后此处为yes)
 RTC in local TZ: no
      DST active: n/a

# 如果上面的NTP synchronized为no,说明同步失败,需检查配置是否正确
# 如果配置正确,上面还是显示no,可尝试设置:sudo timedatectl set-local-rtc 0
sudo timedatectl set-local-rtc 0

6.安装jdk

分别在3台机器上安装jdk8,并配置好环境变量,注意更改环境变量的配置文件后一定记得source一下。

7.部署zookeeper

# 将zookeeper的安装包上传到data1机器上
# 先将zookeeper压缩包解压到指定目录下
tar -zxvf zookeeper-3.4.5-cdh5.15.0.tar.gz /usr/local/zookeeper

# 进入到zookeeper的conf目录下,修改配置文件
# 将默认的zoo_sample.cfg文件复制并重命名为zoo.cfg
cp zoo_sample.cfg zoo.cfg

# 编辑该zoo.cfg配置文件
vi zoo.cfg

# 更改dataDir=/tmp/zookeeper参数
dataDir=/usr/local/zookeeper/data

# 在zoo.cfg文件末尾增加zk集群服务器配置
# 配置参数的模板为:server.X=A:B:C,其中X是一个数字, 表示这是第几号server(就是myid)  
# A是该server所在的IP地址或hostname.  
# B配置该server和集群中的leader交换消息所使用的端口.  
# C配置选举leader时所使用的端口
server.1=data1:2888:3888
server.2=data2:2888:3888
server.3=data3:2888:3888
# zoo.cfg配置文件的其余参数可保持默认不变

# 创建上述配置中dataDir参数指定的data目录
mdkir /usr/local/zookeeper/data

# 进入到该data目录下,创建myid文件,并写入一个唯一标识数字id
# 该id用来唯一标识这个服务,一定要保证在整个集群中唯一
# zookeeper会根据这个id来取出server.x上的配置。比如当前id为1,则对应着zoo.cfg里的server.1的配置
cd /usr/local/zookeeper/data
touch myid
echo 1 > myid

# 至此,data1上的zk配置完毕
# 现在需要将data1上的zk分发到另外两台机器上(data2和data3)
scp /usr/local/zookeeper hadoop@data2:/usr/local/zookeeper
scp /usr/local/zookeeper hadoop@data3:/usr/local/zookeeper

# 分别到data2和data3上更改/usr/local/zookeeper/data/myid文件
# data2机器上的myid文件内容改为2
# data2机器上的myid文件内容改为3
vi /usr/local/zookeeper/data/myid

# 分别在3台机器上都配置上zookeeper的环境变量
# 为了避免每次操作zookeeper命令时都需要进入到zk的bin目录,需要配置zk的环境变量
sudo vi /etc/profile

# 在文件末尾增加以下两行
export ZK_HOME=/usr/local/zookeeper
export PATH=$ZK_HOME/bin:$PATH

# 改完后切记source一下
source /etc/profile

# 至此,所有配置已完毕,可分别在3台机器上启动zk服务
zkServer.sh start

# 启动完毕后,可查看当前zk的状态
zkServer.sh status

8.部署hadoop

8.1 安装hadoop

# 将hadoop安装包上传至data1机器
# 解压hadoop安装包到指定目录
tar -zxvf hadoop-2.6.0-cdh5.15.0.tar.gz /usr/local/hadoop

# 配置hadoop的环境变量
sudo vi /etc/profile

# 在文件末尾增加
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

# 改完后切记source一下
source /etc/profile

8.2 修改hadoop-env.sh文件

# 进入到hadoop的配置文件目录
cd /usr/local/hadoop/etc/hadoop

# 编辑hadoop-env.sh
vi hadoop-env.sh

# 更改export JAVA_HOME={JAVA_HOME}为jdk的安装目录
export JAVA_HOME=/usr/local/jdk1.8.0_251

8.3 修改core-site.xml文件

# 该文件默认只有一个空的<configuration>标签,需要在该标签中添加一下配置
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://cdhbds</value>
  <description>
   The name of the default file system.  
   A URI whose scheme and authority determine the FileSystem implementation.  
   Theuri's scheme determines the config property (fs.SCHEME.impl) namingthe FileSystem implementation class.  
   The uri's authority is used to determine the host, port, etc. for a filesystem.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/data/hadooptmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>io.native.lib.available</name>
  <value>true</value>
  <description>Should native hadoop libraries, if present, be used.</description>
</property>

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>
  <description>
    A comma-separated list of the compression codec classes that can
    be used for compression/decompression. In addition to any classes specified
    with this property (which take precedence), codec classes on the classpath
    are discovered using a Java ServiceLoader.</description>
</property>

<property>
  <name>fs.trash.interval</name>
  <value>1440</value>
  <description>Number of minutes between trash checkpoint. if zero, the trash feature is disabled.</description>
</property>

<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>1440</value>
  <description> 
    Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval. 
    If zero, the value is set to the value of fs.trash.interval </description>
</property>

<property>
    <name>ha.zookeeper.quorum</name>
    <value>data1:2181,data2:2181,data3:2181</value>
    <description>3个zookeeper节点</description>
</property>

8.4 修改hdfs-site.xml文件

# 该文件默认只有一个空的<configuration>标签,需要在该标签中添加一下配置
<property>
    <name>dfs.nameservices</name>
    <value>cdhbds</value>
    <description>
        Comma-separated list of nameservices.
    </description>
</property>

<property>
    <name>dfs.datanode.address</name>
    <value>0.0.0.0:50010</value>
    <description>
       The datanode server address and port for data transfer.
       If the port is 0 then the server will start on a free port.
    </description>
</property>

<property>
    <name>dfs.datanode.balance.bandwidthPerSec</name>
    <value>52428800</value>
</property>

<property>
    <name>dfs.datanode.balance.max.concurrent.moves</name>
    <value>250</value>
</property>

<property>
    <name>dfs.datanode.http.address</name>
    <value>0.0.0.0:50075</value>
    <description>
       The datanode http server address and port.
       If the port is 0 then the server will start on a free port.
    </description>
</property>

<property>
    <name>dfs.datanode.ipc.address</name>
    <value>0.0.0.0:50020</value>
    <description>
       The datanode ipc server address and port.
       If the port is 0 then the server will start on a free port.
    </description>
</property>

<property>
    <name>dfs.ha.namenodes.cdhbds</name>
    <value>nn1,nn2</value>
    <description></description>
</property>

<property>
    <name>dfs.namenode.rpc-address.cdhbds.nn1</name>
    <value>data1:8020</value>
    <description>节点NN1的RPC地址</description>
</property>
                        
<property>
    <name>dfs.namenode.rpc-address.cdhbds.nn2</name>
    <value>data2:8020</value>
    <description>节点NN2的RPC地址</description>
</property>
                                    
<property>
    <name>dfs.namenode.http-address.cdhbds.nn1</name>
    <value>data1:50070</value>
    <description>节点NN1的HTTP地址</description>
</property>
                                                
<property>
    <name>dfs.namenode.http-address.ocdccluster.nn2</name>
    <value>data2:50070</value>
    <description>节点NN2的HTTP地址</description>
</property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>/data/namenode</value>
    <description>
      Determines where on the local filesystem the DFS name node should store the name table.
      If this is a comma-delimited list of directories,then name table is replicated in all of the directories,
      for redundancy.</description>
    <final>true</final>
</property>

<property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>/data/checkpoint</value>
    <description></description>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>/data/datanode</value>
    <description>Determines where on the local filesystem an DFS data node should store its blocks.
         If this is a comma-delimited list of directories,then data will be stored in all named directories,
         typically on different devices.Directories that do not exist are ignored.
    </description>
<final>true</final>
</property>

<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

<property>
    <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
    <value>true</value>
    <description>
   Boolean which enables backend datanode-side support for the experimental DistributedFileSystem*getFileVBlockStorageLocations API.
    </description>
</property>

<property>
    <name>dfs.permissions.enabled</name>
    <value>true</value>
    <description>
        If "true", enable permission checking in HDFS.
        If "false", permission checking is turned off,but all other behavior is unchanged.
        Switching from one parameter value to the other does not change the mode,owner or group of files or directories.
    </description>
</property>

<property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://data1:8485;data2:8485;data3:8485/cdhbds</value>
    <description>采用3个journalnode节点存储元数据,这是IP与端口</description>
</property>
            
<property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/data/journaldata/</value>
    <description>journaldata的存储路径</description>
</property>

<property>
    <name>dfs.journalnode.rpc-address</name>
    <value>0.0.0.0:8485</value>
</property>
        
<property>
    <name>dfs.journalnode.http-address</name>
    <value>0.0.0.0:8480</value>
</property>

<property>
    <name>dfs.client.failover.proxy.provider.cdhbds</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    <description>该类用来判断哪个namenode处于生效状态</description>
</property>

<property>
    <name>dfs.ha.fencing.methods</name>
    <value>shell(/bin/true)</value>
</property>

<property>
    <name>dfs.ha.fencing.ssh.connect-timeout</name>
    <value>10000</value>
</property>

<property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
    <description>
          Whether automatic failover is enabled. See the HDFS High Availability documentation for details 
          on automatic HA configuration.
    </description>
</property>

<property>
     <name>dfs.namenode.handler.count</name>
     <value>20</value>
     <description>The number of server threads for the namenode.</description>
</property>

8.5 修改mapred-site.xml文件

# 将mapred-site.xml.template复制并重命名为mapred-site.xml
# cp mapred-site.xml.template mapred-site.xml

# 编辑该mapred-site.xml文件,在<configuration>标签中增加以下内容
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>

<property>
    <name>mapreduce.shuffle.port</name>
    <value>8350</value>
</property>

<property>
    <name>mapreduce.jobhistory.address</name>
    <value>0.0.0.0:10121</value>
</property>

<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>0.0.0.0:19868</value>
</property>

<property>
    <name>mapreduce.jobtracker.http.address</name>
    <value>0.0.0.0:50330</value>
</property>

<property>
    <name>mapreduce.tasktracker.http.address</name>
    <value>0.0.0.0:50360</value>
</property>

<property>
    <name>mapreduce.map.output.compress</name> 
    <value>true</value>
</property>
              
<property>
    <name>mapreduce.map.output.compress.codec</name> 
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

<property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
</property>

<property>
    <name>mapreduce.job.counters.max</name>
    <value>560</value>
    <description>Limit on the number of counters allowed per job.</description>
</property>

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx4096m</value>
</property>

<property>
    <name>mapreduce.map.memory.mb</name>
    <value>3072</value>
</property>

<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>4096</value>
</property>

<property>
    <name>mapreduce.map.cpu.vcores</name>
    <value>1</value>
</property>

<property>
    <name>mapreduce.reduce.cpu.vcores</name>
    <value>1</value>
</property>

<property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>300</value>
</property>

8.6 修改yarn-env.sh文件

# 编辑yarn-env.sh文件
vi yarn-env.sh

# 更改export JAVA_HOME={JAVA_HOME}为jdk的安装目录
export JAVA_HOME=/usr/local/jdk1.8.0_251

8.7 修改yarn-site.xml文件

# 在<configuration>标签中增加以下内容
<!-- Site specific YARN configuration properties -->
<property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
    <value>2000</value>
</property>

<property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
</property>

<property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
</property>

<property>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
</property>

<property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>yarn-rm-cluster</value>
</property>

<property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
</property>

<property>
    <description>Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value.</description>
    <name>yarn.resourcemanager.ha.id</name>
    <value>rm1</value>
</property>

<property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
</property>

<property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>

<property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>data1:2181,data2:2181,data3:2181</value>
</property>

<property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
    <value>5000</value>
</property>

<property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

<!-- 使用公平调度队列,需要依赖于一个公平调度队列的配置文件,在下一步中将对这个文件进行配置 -->
<property>
    <name>yarn.scheduler.fair.allocation.file</name>
    <value>fair-scheduler.xml</value>
</property>

<!-- RM1 configs -->
<property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>data1:8032</value>
</property>

<property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>data1:8030</value>
</property>

<property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>data1:50030</value>
</property>

<property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>data1:8031</value>
</property>

<property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>data1:8033</value>
</property>

<property>
    <name>yarn.resourcemanager.ha.admin.address.rm1</name>
    <value>data1:8034</value>
</property>

<!-- RM2 configs -->
<property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>data2:8032</value>
</property>

<property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>data2:8030</value>
</property>

<property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>data2:50030</value>
</property>

<property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
    <value>data2:8031</value>
</property>

<property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>data2:8033</value>
</property>

<property>
    <name>yarn.resourcemanager.ha.admin.address.rm2</name>
    <value>data2:8034</value>
</property>

<!-- Node Manager Configs -->
<property>
    <description>Address where the localizer IPC is.</description>
    <name>yarn.nodemanager.localizer.address</name>
    <value>0.0.0.0:23344</value>
</property>

<property>
    <description>NM Webapp address.</description>
    <name>yarn.nodemanager.webapp.address</name>
    <value>0.0.0.0:23999</value>
</property>

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

<property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>112640</value>
</property>

<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
</property>

<property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>31</value>
</property>

<property>
    <name>yarn.scheduler.increment-allocation-mb</name>
    <value>512</value>
</property>

<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
</property>

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
</property>

<property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/data/yarn/logs</value>
</property>

8.8 修改fair-scheduler.xml文件

# 创建一个fair-scheduler.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<!--
       Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<!--
  This file contains pool and user allocations for the Fair Scheduler.
  Its format is explained in the Fair Scheduler documentation at
  http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
  The documentation also includes a sample config file.
-->
<allocations>
<!-- 自定义一个队列名叫dev,并指定这个队列的最大资源和最小资源 -->
<queue name="dev">
    <minResources>10240 mb, 10 vcores</minResources>
    <maxResources>51200 mb, 18 vcores</maxResources>
    <schedulingMode>fair</schedulingMode>
    <weight>5</weight>
    <maxRunningApps>30</maxRunningApps>
</queue>
</allocations>

8.9 修改slaves文件

# 编辑slaves文件
vi slaves

# 将localhost替换为以下三行,表示这3台机器将作为hadoop集群的从节点
# 即会在这3台机器上启动datanode和nodeManager服务
data1
data2
data3

8.10 分发hadoop包

# 将data1上配置好的hadoop包分发到另外两台机器上(data2和data3)
scp -rp /usr/local/hadoop hadoop@data2:/usr/local/hadoop
scp -rp /usr/local/hadoop hadoop@data3:/usr/local/hadoop

# 分发完毕后,还需要修改data2上的yarn-site.xml文件里的一个配置
# 我们是将data1和data2机器作为ResourceManager的HA模式部署机器
# 将下面这个属性的值从rm1改为rm2,否则在data2上启动ResourcManager服务的时候会报data1:端口被占用
<property>
    <description>Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value.</description>
    <name>yarn.resourcemanager.ha.id</name>
    <value>rm2</value>
</property>

# 在data2和data3机器上也配置好hadoop的环境变量
sudo vi /etc/profile

# 在文件末尾增加
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

# 改完后切记source一下
source /etc/profile

8.11 初始化并启动集群

# 首先要确保前面已经将zookeeper集群启动成功了
# 分别在3台机器上启动journalnode节点
hadoop-daemon.sh start journalnode

# 在data1上初始化NameNode
hadoop namenode -format

# 将data1上的namenode元数据目录拷贝至data2机器,使两台nameNode节点的元数据在初始化后一致
# 即hdfs-site.xml 文件中配置的这个目录/data/namenode
# <property>
#    <name>dfs.namenode.name.dir</name>
#    <value>/data/namenode</value>
#    <description>
#      Determines where on the local filesystem the DFS name node should store the name table.
#      If this is a comma-delimited list of directories,then name table is replicated in all of the #directories,
#      for redundancy.</description>
#    <final>true</final>
# </property>
scp -rp /data/namenode hadoop@data2:/data/namenode

# 在data1机器上初始化ZFCK
hdfs zkfc -formatZK

# 在data1机器上启动hdfs分布式存储系统
start-dfs.sh

# 在data1机器上启动yarn集群
# 此命令将会在data1上启动ResourceManager服务,在data1、data2、data3上启动NodeManager服务
start-yarn.sh

# 在data2机器上启动ResourceManager服务
yarn-daemon.sh start resourcemanager

8.12 验证hadoop集群

(1)访问hdfs集群的web UI管理页面
在Windows机器上访问http://data1:50070,将呈现hdfs集群的基本信息:
image.png

(2)访问yarn集群的web UI管理页面
在Windows机器上访问http://data1:50030,将呈现yarn集群资源的基本信息“
image.png

9.部署hive

因hive的元数据我们是存的MySQL,故需要提前安装好MySQL数据库。
(1)安装hive包

# 将hive安装包上传至data1机器,并解压至指定目录
tar -zxvf hive-1.1.0-cdh5.15.0.tar.gz /usr/local/hive

# 配置hive的环境变量
# 在/etc/profile 文件中增加hive的path变量配置
export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH

# 配置完环境变量后要source /etc/profile
source /etc/profile

(2)拷贝MySQL驱动包

将MySQL的驱动jar包拷贝到hive安装目录的lib目录下,即/usr/local/hive/lib.

(3)修改hive-env.sh文件

# 进入到hive的conf目录下,将hive-env.sh.template文件拷贝并重命名为hive-env.sh
cp hive-env.sh.tmplate hive-env.sh

# 修改hive-env.sh文件的以下两个配置参数
HADOOP_HOME=/usr/local/hadoop
export HIVE_CONF_DIR=/usr/local/hive/conf

(4)修改hive-site.xml文件

# 修改hive-site.xml文件内容
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://192.168.66.240:3306/hive?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>123456</value>
  <description>password to use against metastore database</description>
</property>

<property>
  <name>hive.exec.compress.output</name>
  <value>true</value>
  <description> This controls whether the final outputs of a query (to a local/HDFS file or a Hive table) is compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress* </descriptio
n>
</property>

<property>
  <name>hive.exec.compress.intermediate</name>
  <value>true</value>
  <description> This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress* </descript
ion>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>true</value>
  <description>creates necessary schema on a startup if one doesn't exist. set this to false, after creating it once</description>
</property>

<property>
  <name>hive.mapjoin.check.memory.rows</name>
  <value>100000</value>
  <description>The number means after how many rows processed it needs to check the memory usage</description>
</property>

<property>
  <name>hive.auto.convert.join</name>
  <value>true</value>
  <description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size</description>
</property>

<property>
  <name>hive.auto.convert.join.noconditionaltask</name>
  <value>true</value>
  <description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file 
    size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the
    specified size, the join is directly converted to a mapjoin (there is no conditional task).
  </description>
</property>

<property>
  <name>hive.auto.convert.join.noconditionaltask.size</name>
  <value>10000000</value>
  <description>If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it
    is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly
    converted to a mapjoin(there is no conditional task). The default is 10MB
  </description>
</property>

<property>
  <name>hive.auto.convert.join.use.nonstaged</name>
  <value>false</value>
  <description>For conditional joins, if input stream from a small alias can be directly applied to join operator without
    filtering or projection, the alias need not to be pre-staged in distributed cache via mapred local task.
    Currently, this is not working with vectorization or tez execution engine.
  </description>
</property>

<property>
  <name>hive.mapred.mode</name>
  <value>nonstrict</value>
  <description>The mode in which the Hive operations are being performed.
     In strict mode, some risky queries are not allowed to run. They include:
       Cartesian Product.
       No partition being picked up for a query.
       Comparing bigints and strings.
       Comparing bigints and doubles.
       Orderby without limit.
  </description>
</property>

<property>
  <name>hive.exec.parallel</name>
  <value>true</value>
  <description>Whether to execute jobs in parallel</description>
</property>

<property>
  <name>hive.exec.parallel.thread.number</name>
  <value>8</value>
  <description>How many jobs at most can be executed in parallel</description>
</property>

<property>
  <name>hive.exec.dynamic.partition</name>
  <value>true</value>
  <description>Whether or not to allow dynamic partitions in DML/DDL.</description>
</property>

<property>
  <name>hive.exec.dynamic.partition.mode</name>
  <value>nonstrict</value>
  <description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description>
</property>

<property>  
  <name>hive.metastore.uris</name>  
  <value>thrift://data1:9083</value>  
  <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>  
</property>

<property>
<name>hive.server2.enable.impersonation</name>
<description>Enable user impersonation for HiveServer2</description>
<value>false</value>
</property>

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value>
</property>

<property>
  <name>hive.input.format</name>
  <value>org.apache.hadoop.hive.ql.io.CombineHiveInputFormat</value>
</property>

<property>
  <name>hive.merge.mapfiles</name>
  <value>true</value>
</property>

<property>
  <name>hive.merge.mapredfiles</name>
  <value>true</value>
</property>

<property>
  <name>hive.merge.size.per.task</name>
  <value>256000000</value>
</property>

<property>
  <name>hive.merge.smallfiles.avgsize</name>
  <value>256000000</value>
</property>

<property>
    <name>hive.server2.logging.operation.enabled</name>
    <value>true</value>
</property>

<!--SENTRY META STORE-->
<!-- <property>
<name>hive.metastore.filter.hook</name>
<value>org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook</value>
</property>

<property>  
    <name>hive.metastore.pre.event.listeners</name>  
    <value>org.apache.sentry.binding.metastore.MetastoreAuthzBinding</value>  
    <description>list of comma separated listeners for metastore events.</description>
</property>

<property>
    <name>hive.metastore.event.listeners</name>  
    <value>org.apache.sentry.binding.metastore.SentryMetastorePostEventListener</value>  
    <description>list of comma separated listeners for metastore, post events.</description>
</property> -->

<!--SENTRY SESSION-->
<!--<property>
   <name>hive.security.authorization.task.factory</name>
   <value>org.apache.sentry.binding.hive.SentryHiveAuthorizationTaskFactoryImpl</value>
</property>

<property>
   <name>hive.server2.session.hook</name>
   <value>org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook</value>
</property>

<property>
   <name>hive.sentry.conf.url</name>
   <value>file:///usr/local/hive-1.1.0-cdh5.15.0/conf/sentry-site.xml</value>
</property>
</configuration>

(5)修改hive-log4j.properties文件

# 拷贝hive-log4j.properties.template并重命名为hive-log4j.properties
cp hive-log4j.properties.template hive-log4j.properties

# 编辑hive-log4j.properties,更改hive的日志文件路径
hive.log.dir=/data/hive/logs

(6)启动metaStore服务和hiveserver2服务

# 启动metaStore服务
nohup hive --service metastore &

# 启动hiveserver2服务(有jdbc连接hive需求的可以启动hiveserver2服务)
nohup hive --service hiveserver2 &

(7)验证hive

# 在data1机器上输入hive命令,将进入到hive的命令行客户端
hive

10.部署Hbase

(1)安装HBase包

# 将HBase安装包上传至data1机器,并解压至指定位置
tar -zxvf hbase-1.2.0-cdh5.15.0.tar.gz /usr/local/hbase

# 配置hbase环境变量
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin

# 配置完环境变量后要source /etc/profile
source /etc/profile

(2)修改hbase-env.sh文件

# 编辑hbase-env.sh文件,修改HBASE_MANAGES_ZK=false配置,使hbase不使用自带的zookeeper
export HBASE_MANAGES_ZK=false

(3)修改hbase-site.xml文件

# 编辑hbase-site.xml,增加以下配置内容
<!-- hbase在hdfs上的数据目录 -->
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://cdhbds/hbase</value>
  </property>
  <!-- hdfs高可用集群名称-->
  <property>
    <name>dfs.nameservices</name>
    <value>cdhbds</value>
  </property>
  <!-- 开启hbase集群模式-->
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property >
    <name>hbase.tmp.dir</name>
    <value>/data/hbase/tmp</value>
  </property>
  <property>
    <name>hbase.master.port</name>
    <value>16000</value>
  </property>
  <!-- hbase要注册的zookeepeer集群地址-->
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>data1,data2,data3</value>
  </property>
  <!-- hbase要注册的zookeepeer集群的端口-->
  <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
 </property>

(4)拷贝hdfs的配置文件

将hadoop配置目录下的core-site.xml和hdfs-site.xml文件拷贝至hbase的conf目录下

(5)配置regionservers文件

# 配置HRegionServer节点的主机名,编辑regionservers文件,添加以下内容
data1
data2
data3

(6)配置HMaster高可用

# 在hbase的conf目录下创建一个backup-masters文件,并添加HMaster备用节点的主机名
data2

(7)分发hbase包

# 将data1上配置好的hbase包分发到另外两台机器(data2和data3)
scp -rp /usr/local/hbase hadoop@data2:/usr/local/hbase
scp -rp /usr/local/hbase hadoop@data3:/usr/local/hbase

(8)启动hbase集群

# 在data1机器上执行启动hbase集群命令
start-hbase.sh

# 启动完成后,分别在3台机器上执行jps命令,可以查看到每台机器上启动的hbase进程
# data1机器上,可看到HMaster和HRegionServer进程
# data2机器上,可看到HMaster和HRegionServer进程
# data3机器上,可看到HRegionServer进程
jps

(9)验证hbase

# 在data1机器上输入hbase shell命令,将进入到hbase的命令行客户端
hbase shell

此外,也可访问hbase的UI管理页面:http://data1:60010

11.部署kafka

(1)安装kafka包

# 将kafka安装包上传至data1机器,并解压至指定位置
tar -zxvf kafka_2.12-0.11.0.3.tgz /usr/local/kafka

(2)修改server.properties文件

# 编辑server.properties文件,修改以下内容
# broker的唯一标识
broker.id=0
# 修改为当前机器的hostname
listeners=PLAINTEXT://data1:9092
# 修改kafka的日志路径(也是kafka消息数据存储的路径)
log.dirs=/data/kafka-logs
# 修改zookeeper连接地址
zookeeper.connect=data1:2181,data2:2181,data3:2181

(3)分发kafka包

# 将data1上的kafka包分发到另外两台机器(data2和data3)
scp -rp /usr/loca/kafka hadoop@data2:/usr/local/kafka
scp -rp /usr/loca/kafka hadoop@data3:/usr/local/kafka

# 更改data2上的server.properties文件里的以下两个配置
# broker的唯一标识
broker.id=1
# 修改为当前机器的hostname
listeners=PLAINTEXT://data2:9092

# 同理,更改data3上对应的配置
# broker的唯一标识
broker.id=2
# 修改为当前机器的hostname
listeners=PLAINTEXT://data3:9092

(4)启动kafka集群

# 在3台机器上分别启动kafka
# 进入到kafka的bin目录,执行启动命令
cd /usr/local/kafka/bin
./kafka-server-start.sh -daemon ../config/server.properties
# 命令中的-daemon参数表示后台启动kafka

(5)验证kafka

# 分别在3台机器上执行jps命令,可查看到kafka进程是否启动成功
jps

# 创建topic命令
bin/kafka-topics.sh --create --zookeeper data1:2181,data2:2181,data3:2181 --replication-factor 3 --partitions 1 --topic test

# 模拟生产者命令
bin/kafka-console-producer.sh --broker-list data1:9092,data2:9092,data3:9092 --topic test

# 模拟消费者命令
bin/kafka-console-consumer.sh --bootstrap-server data1:9092,data2:9092,data3:9092 --from-beginning --topic test

12.部署flink on yarn

本次flink部署采用的是on yarn模式(HA)。

(1)安装flink包

# 将flink安装包上传至data1机器,并解压至指定位置
tar -zxvf flink-1.10.1-bin-scala_2.12.tgz /usr/local/flink

(2)修改flink-conf.yaml文件

# 进入到flink的conf目录,编辑flink-conf.yaml文件
vi flink-conf.yaml

# 修改以下内容的配置
taskmanager.numberOfTaskSlots: 4

high-availability: zookeeper
high-availability.storageDir: hdfs://cdhbds/flink/ha/
high-availability.zookeeper.quorum: data1:2181,data2:2181,data3:2181
high-availability.zookeeper.path.root: /flink

state.backend: filesystem
state.checkpoints.dir: hdfs://cdhbds/flink/flink-checkpoints
state.savepoints.dir: hdfs://cdhbds/flink/flink-checkpoints

jobmanager.archive.fs.dir: hdfs://cdhbds/flink/completed-jobs/
historyserver.archive.fs.dir: hdfs://cdhbds/flink/completed-jobs/

yarn.application-attempts: 10

(3)修改日志配置文件

因为flink的conf目录下有log4j和logback的配置文件,在启动flink集群的时候会有一个警告:
org.apache.flink.yarn.AbstractYarnClusterDescriptor           - The configuration directory ('/root/flink-1.7.1/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.

故需要去掉一个日志配置文件,我们可以将log4j.properties给重命名为log4j.properties.bak即可.

(4)配置hadoop classpath

# 该版本的flink默认是没有和hadoop集成的,官方文档指出需要我们自己去完成hadoop集成的配置,
# 官网上给出了两种集成方案,一种是添加hadoop classpath的配置,
# 另一种是将flink-shaded-hadoop-2-uber-xx.jar拷贝至flink的lib目录下,
# 这里我们采用第一种方式,通过配置hadoop classpath
# 编辑/etc/profile文件,添加以下内容
export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)

# 配置完环境变量后要source /etc/profile
source /etc/profile 

(5)以yarn-session方式启动flink集群

# 这里我们以yarn-session方式来启动集群,进入到bin目录,启动命令如下:
./yarn-session.sh -s 4 -jm 1024m -tm 4096m -nm flink-test -d

(6)启动historyServer服务

# 启动historyServer服务命令
./historyserver.sh start
阅读 2.5k
1 声望
0 粉丝
0 条评论
1 声望
0 粉丝
文章目录
宣传栏