所遇问题
最近给公司的Hadoop实验集群做了HA,但做完之后发现如果直接kill掉active namenode进程可以自动切换到standby namenode,如果active namenode所在节点直接宕机(init 0)的话就无法自动切换到standby namenode
查看standby 节点的zkfc日志发现一只在尝试通过ssh连接原先active namenode所在节点,但是该节点已经宕机也就无法通过ssh连接,所以一只循环报错,难道hadoop的HA不是为了这个场景设计的吗
报错部分代码如下
135884 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
135885 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
135886 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to shell04...
135887 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to shell04 port 22
135888 2018-12-03 19:11:50,488 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to shell04 as user root
135889 com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host
135890 at com.jcraft.jsch.Util.createSocket(Util.java:394)
135891 at com.jcraft.jsch.Session.connect(Session.java:215)
135892 at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
135893 at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
135894 at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
135895 at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
135896 at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
135897 at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
135898 at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
135899 at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
135900 at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
135901 at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
135902 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
135903 Caused by: java.net.NoRouteToHostException: No route to host
135904 at java.net.PlainSocketImpl.socketConnect(Native Method)
135905 at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
135906 at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
135907 at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
135908 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
135909 at java.net.Socket.connect(Socket.java:579)
135910 at java.net.Socket.connect(Socket.java:528)
135911 at java.net.Socket.<init>(Socket.java:425)
135912 at java.net.Socket.<init>(Socket.java:208)
135913 at com.jcraft.jsch.Util$1.run(Util.java:362)
135914 at java.lang.Thread.run(Thread.java:745)
135915 2018-12-03 19:11:50,490 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
135916 2018-12-03 19:11:50,490 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
135917 2018-12-03 19:11:50,490 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
135918 java.lang.RuntimeException: Unable to fence NameNode at shell04/192.168.254.143:9000
135919 at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
135920 at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
135921 at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
135922 at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
135923 at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
135924 at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
135925 at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
135926 at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
135927 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
解决思路
既然导致此问题的原因是sshfencing导致的,那么如果尝试使用shell的方式进行fence会怎么样,于是我将dfs.ha.fencing.methods 加上了shell(最初我并不知道能够与sshfence同时存在)
之所以使用/bin/true 是因为此处无需shell真正的去执行kill namenode 的任务,因为如果active node可连通,已经被sshfence隔离,如果active node 不可连通则由此shell执行(其主要作用是让这一步骤能够进行下去);经测试,将active node init 0 之后standby namenode 的zkfc log 显示依然会先执行sshfence方式进行隔离,紧随其后再执行shell方式进行隔离,成功解决了这个问题