1、使用命令来查看namenode的存活

hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2
hdfs haadmin -getAllServiceState

2、分析-getAllServiceState调用流程

org.apache.hadoop.ha.HAAdmin#runCmd

if ("-transitionToActive".equals(cmd)) {
  return transitionToActive(cmdLine);
} else if ("-transitionToStandby".equals(cmd)) {
  return transitionToStandby(cmdLine);
} else if ("-getServiceState".equals(cmd)) {
  return getServiceState(cmdLine);
  // TODO 针对 hdfs haadmin -getAllServiceState 命令会走到这里
} else if ("-getAllServiceState".equals(cmd)) {
  return getAllServiceState();
} else if ("-checkHealth".equals(cmd)) {
  return checkHealth(cmdLine);
} else if ("-help".equals(cmd)) {
  return help(argv);
} else {
  // we already checked command validity above, so getting here
  // would be a coding error
  throw new AssertionError("Should not get here, command: " + cmd);
}

org.apache.hadoop.ha.HAAdmin#getAllServiceState

protected int getAllServiceState() {
    // TODO 在这里的时候其实会获取到,nn1和nn2
    /**
     * <property>
     *       <name>dfs.namenode.rpc-address.nnCluster.nn1</name>
     *       <value>node1:8020</value>
     * </property>
     *
     * <property>
     *       <name>dfs.namenode.rpc-address.nnCluster.nn2</name>
     *       <value>node1:8020</value>
     * </property>
     */
    Collection<String> targetIds = getTargetIds(null);
    if (targetIds.isEmpty()) {
      errOut.println("Failed to get service IDs");
      return -1;
    }

    // TODO 其实targetIds这里应该是nn1和nn2
    for (String targetId : targetIds) {
      // TODO 在这里会解析nameserviceId,其实就是上面的nnCluster,这个也是配置的
      /**
       * <property>
       *       <name>dfs.nameservices</name>
       *       <value>nnCluster</value>
       * </property>
       */
      // TODO 就可以对应的node1和node2的namenode的address
      HAServiceTarget target = resolveTarget(targetId);
      String address = target.getAddress().getHostName() + ":"
          + target.getAddress().getPort();
      try {
        // TODO 通过RPC来完成获取nn的状态
        HAServiceProtocol proto = target.getProxy(getConf(),
            rpcTimeoutForChecks);

        // TODO roto.getServiceStatus() 这个是走rpc的
        out.println(String.format("%-50s %-10s", address, proto
            .getServiceStatus().getState()));
      } catch (IOException e) {
        out.println(String.format("%-50s %-10s", address,
            "Failed to connect: " + e.getMessage()));
      }
    }
    return 0;
  }

3、NameNode HA架构图

NameNode HA架构图.png

1、NameNode竞争在ZooKeeper上进行注册,即创建一个临时节点目录ActiveStandbyElectorLock,写入NN的host、port、nameserviceId、namenodeI等信息,那个写入成功,那个就是Active状态
2、注册成功后,同时会创建一个ActiveBreadCrumb永久节点(用来进行切换的时候,如果不是该当前Active NameNode,需要进行fence隔离)通过create后的watcher机制,FailoverController会发送命令给各个NN,让其确定各自状态和职责
3、Monitor Health会周期性连接NN,检查NN状态,并由可能触发重新选举,即重复1-2
4、FailoverController会与ZooKeeper保持心跳,注册的临时节点消失后,也会触发重新选举

4、NameNode HA源码架构图

4,NameNode HA.png

核心代码 :
org.apache.hadoop.ha.ZKFailoverController#doGracefulFailover

....
// TODO 获取当前Active NameNode
HAServiceTarget oldActive = getCurrentActive();
if (oldActive == null) {
  // No node is currently active. So, if we aren't already
  // active ourselves by means of a normal election, then there's
  // probably something preventing us from becoming active.
  throw new ServiceFailedException(
      "No other node is currently active.");
}

// TODO 这里其实说明如果老的获取Active NameNode已经是要变为Active的namenode,不需要操作了
if (oldActive.getAddress().equals(localTarget.getAddress())) {
  LOG.info("Local node " + localTarget + " is already active. " +
      "No need to failover. Returning success.");
  return;
}

// Phase 2b: get the other nodes
// TODO 获取其它节点的namenode
List<HAServiceTarget> otherNodes = getAllOtherNodes();
List<ZKFCProtocol> otherZkfcs = new ArrayList<ZKFCProtocol>(otherNodes.size());

// Phase 3: ask the other nodes to yield from the election.
long st = System.nanoTime();
HAServiceTarget activeNode = null;
for (HAServiceTarget remote : otherNodes) {
  // same location, same node - may not always be == equality
  if (remote.getAddress().equals(oldActive.getAddress())) {
    activeNode = remote;
    continue;
  }
  // TODO 其他的zkfc添加进来,其实就是让自己释放Active角色,进行重新选举,让standby的作为active
  otherZkfcs.add(cedeRemoteActive(remote, timeout));
}

assert
  activeNode != null : "Active node does not match any known remote node";

// Phase 3b: ask the old active to yield
// TODO 这个是让active 作为standby
otherZkfcs.add(cedeRemoteActive(activeNode, timeout));

// Phase 4: wait for the normal election to make the local node
// active.
// TODO 等待当前选举为Active节点
ActiveAttemptRecord attempt = waitForActiveAttempt(timeout + 60000, st);
....

大白话说一下核心逻辑 :
如果ZKFC对应的NN是Active,不做操作。如果是standby的话就会走如下的逻辑
1、先让自己的zk连接断掉,让自己作为Standby,本身就是Standby。然后进行重新选举。其实就是监听现在的Active节点的lock
2、然后让Active的zk链接断掉,让自己作为Standby,本身是active。同时删除 ActiveBreadCrumb。然后进行重新选举
3、这一刻,原来的Standby就会获取到lock,直接自己成为Active节点,走becomActive,原来的active节点走becomeStandby
4、在走Active的时候,会进行fenceOldActive,其实就是根据配置

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/bin/true)</value>
</property>

进行NameNode的隔离。这个不是什么都不执行么?怎么实现的隔离 ?

JournalNode就出场了

简单描述
初始化后,Active把editlog日志写到2N+1上JN上,每个editlog有一个编号,每次写editlog只要其中大多数JN返回成功(即大于等于N+1)即认定写成功。
Standby定期从JN读取一批editlog,并应用到内存中的FsImage中。将自己的FSImage通过HTTP请求发送到Active NameNode上,应用替换Active NameNode中的FSImage

如何fencing: NameNode每次写Editlog都需要传递一个编号Epoch给JN,JN会对比Epoch,如果比自己保存的Epoch大或相同,则可以写,JN更新自己的Epoch到最新,否则拒绝操作。在切换时,Standby转换为Active时,会把Epoch+1,这样就防止即使之前的NameNode向JN写日志,也会失败


journey
32 声望23 粉丝