我在三台主机上配置Hadoop 2.2.0
和HBase 0.98.3
(其自带的hadoop common的jar包就是2.2.0版本的),一台机器为Debian 7.5
, 机器名为psyDebian,运行namenode
和datanode
,另外两台为centos 6.5
, 名字分别为centos1和centos2,运行datanode
。Hadoop的hdfs运行良好,但在其基础上运行HBase时,刚开始启动良好,但是过一会后HMaster和HRegionServer自动退出。一开始时主节点上jps是这样的:
5416 NameNode
5647 SecondaryNameNode
5505 DataNode
398 Jps
32745 HMaster
32670 HQuorumPeer
等了十几秒后,jps是这样:
5416 NameNode
5647 SecondaryNameNode
5505 DataNode
423 Jps
32670 HQuorumPeer
从节点中的HRegionServer也是同样如此。
hbase-hadoop-master-psyDebian.log日志有错误,如下:
2014-07-16 09:57:01,061 WARN [main-SendThread(centos1:2181)] zookeeper.ClientCnxn: Session 0x0 for server centos1/192.168.1.110:2181, unexpected error, closing socket connection and attempting reconnect
...
2014-07-16 09:57:02,128 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=centos1:2181,psyDebian:2181,centos2:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
2014-07-16 09:57:02,129 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper create failed after 4 attempts
2014-07-16 09:57:02,129 ERROR [main] master.HMasterCommandLine: Master exiting
...
java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster
...
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
貌似是zookeeper有错误,但是我用的是hbase自带的,HBASE_MANAGES_ZK=true
已经设置,而且HQuorumPeer进程一直在运行。
在hbase-hadoop-master-psyDebian.out中有个警告:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
hbase-hadoop-zookeeper-psyDebian.log中只重复如下信息:
2014-07-16 10:05:57,502 INFO [WorkerReceiver[myid=0]] quorum.FastLeaderElection: Notification: 1 (n.leader), 0x0 (n.zxid), 0x26 (n.round), LEADING (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)
在从节点中的hbase-hadoop-regionserver-centos1.log日志如下
2014-07-16 09:57:05,150 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
从节点hbase-hadoop-hbase-centos1.out居然报了找不到类的错误
Exception in thread "main" java.lang.NoClassDefFoundError: hbase
Caused by: java.lang.ClassNotFoundException: hbase
因为我是刚开始使用Hbase,不太了解源码,所以没法从源码级别找错误。我在网上寻找了四五天的答案,没找到有用的信息。三台机器时间已经同步,防火墙全部关闭,也反复删除临时文件重启多次,依旧HMaster和HRegionServer十几秒后退出。后面是我的关键配置文件:
hadoop下的hdfs-site.xml
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/hadoop_tmp/dfs/data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/hadoop_tmp/dfs/name</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://psyDebian:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop_tmp</value>
</property>
</configuration>
slaves
psyDebian
centos1
centos2
hbase下的hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://psyDebian:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master</name>
<value>psyDebian:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>psyDebian,centos1,centos2</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper_tmp</value>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>90000</value>
</property>
<property>
<name>hbase.reginserver.restart.on.zk.expire</name>
<value>true</value>
</property>
</configuration>
regionservers
centos1
centos2
自问自答。今天运行hadoop程序时发现错误:
然后根据提示的链接逐项排查错误,将
/etc/hosts
中的127.0.0.1 psyDebian
删除(从节点对应也删除)后程序运行正常。接着尝试运行HBase,没有出现问题!创建表也正常了!一开始也知道得删除
hosts
文件中127.0.1.1
,但是没想到127.0.0.1 主机名
也得删除。就这样,困扰我一个星期左右的问题解决了。