Hadoop HA active namenode宕机无法自动切换到standby namenode

所遇问题

    最近给公司的Hadoop实验集群做了HA,但做完之后发现如果直接kill掉active namenode进程可以自动切换到standby namenode,如果active namenode所在节点直接宕机(init 0)的话就无法自动切换到standby namenode
    查看standby 节点的zkfc日志发现一只在尝试通过ssh连接原先active namenode所在节点,但是该节点已经宕机也就无法通过ssh连接,所以一只循环报错,难道hadoop的HA不是为了这个场景设计的吗

报错部分代码如下

135884 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
135885 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
135886 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to shell04...
135887 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to shell04 port 22
135888 2018-12-03 19:11:50,488 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to shell04 as user root
135889 com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host
135890         at com.jcraft.jsch.Util.createSocket(Util.java:394)
135891         at com.jcraft.jsch.Session.connect(Session.java:215)
135892         at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
135893         at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
135894         at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
135895         at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
135896         at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
135897         at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
135898         at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
135899         at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
135900         at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
135901         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
135902         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
135903 Caused by: java.net.NoRouteToHostException: No route to host
135904         at java.net.PlainSocketImpl.socketConnect(Native Method)
135905         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
135906         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
135907         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
135908         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
135909         at java.net.Socket.connect(Socket.java:579)
135910         at java.net.Socket.connect(Socket.java:528)
135911         at java.net.Socket.<init>(Socket.java:425)
135912         at java.net.Socket.<init>(Socket.java:208)
135913         at com.jcraft.jsch.Util$1.run(Util.java:362)
135914         at java.lang.Thread.run(Thread.java:745)
135915 2018-12-03 19:11:50,490 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
135916 2018-12-03 19:11:50,490 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
135917 2018-12-03 19:11:50,490 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
135918 java.lang.RuntimeException: Unable to fence NameNode at shell04/192.168.254.143:9000
135919         at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
135920         at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
135921         at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
135922         at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
135923         at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
135924         at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
135925         at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
135926         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
135927         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
阅读 5.8k
1 个回答

解决思路

    既然导致此问题的原因是sshfencing导致的,那么如果尝试使用shell的方式进行fence会怎么样,于是我将dfs.ha.fencing.methods 加上了shell(最初我并不知道能够与sshfence同时存在)

<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence
shell(/bin/true)
</value>
</property>

    之所以使用/bin/true 是因为此处无需shell真正的去执行kill namenode 的任务,因为如果active node可连通,已经被sshfence隔离,如果active node 不可连通则由此shell执行(其主要作用是让这一步骤能够进行下去);经测试,将active node init 0 之后standby namenode 的zkfc log 显示依然会先执行sshfence方式进行隔离,紧随其后再执行shell方式进行隔离,成功解决了这个问题

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进