Vivo&#39;s 10,000-scale HDFS cluster upgrade HDFS 3.x practice

vivo Internet Big Data Team-Lv Jia

The first stable version of Hadoop 3.x was released at the end of 2017 with many major improvements.

In terms of HDFS, new features such as Erasure Coding, More than 2 NameNodes, Router-Based Federation, Standby NameNode Read, FairCallQueue, and Intra-datanode balancer are supported. These new features bring many benefits in terms of stability, performance, cost, etc. We plan to upgrade the HDFS cluster to HDFS 3.x version.

This article will introduce how we rolled CDH 5.14.4 HDFS 2.6.0 to HDP-3.1.4.0-315 HDFS 3.1.1 version, which is one of the few cases in the industry to roll out from CDH cluster to HDP cluster. What problems did you encounter during the upgrade? How are these problems solved? This article has a very high reference value.

1. Background

Vivo offline data warehouse Hadoop cluster is built based on CDH 5.14.4 version, CDH 5.14.4 Hadoop version: 2.6.0+CDH 5.14.4+2785, after Cloudera has inserted some optimization patches based on Apache Hadoop version 2.6.0 Hadoop distribution.

In recent years, with the development of vivo's business, data has exploded, and the offline data warehouse HDFS cluster has expanded from one to ten, with a scale of nearly 10,000 units. As the size of the HDFS cluster grows, some of the pain points of the current version of HDFS are exposed:

In the current low-level version of HDFS, RPC performance problems often occur in the online environment NameNode, and users' Hive/Spark offline tasks will also cause task delays due to slow NameNode RPC performance.
Some RPC performance issues have been fixed in HDFS 3.x versions. Currently, online NameNode RPC performance issues can only be resolved by inserting a patch from a higher version of HDFS.
Frequent patch merging increases the complexity of HDFS code maintenance. The NameNode or DataNode needs to be restarted for each patch to go online, which increases the operation and maintenance cost of the HDFS cluster.
Online HDFS clusters use viewfs to provide external services. There are many business lines within the company. Many business departments have applied for independent HDFS clients to access offline data warehouse clusters. After modifying the online HDFS configuration, updating the HDFS client configuration is a very time-consuming and troublesome thing.
HDFS 2.x does not support EC, and cold data cannot use EC to reduce storage costs.

The first stable version of Hadoop 3.x was released at the end of 2017 with many major improvements. In terms of HDFS, new features such as Erasure Coding, More than 2 NameNodes, Router-Based Federation, Standby NameNode Read, FairCallQueue, and Intra-datanode balancer are supported. The new features of HDFS 3.x bring many benefits in terms of stability, performance, and cost.

The new features of HDFS Standby NameNode Read, FairCallQueue and HDFS 3.x NameNode RPC optimization patch can greatly improve the stability and RPC performance of our current version of HDFS cluster.
HDFS RBF replaces viewfs, simplifies the HDFS client configuration update process, and solves the pain point of updating many HDFS client configurations online.
HDFS EC applies cold data storage to reduce storage costs.

Based on the above pain points and benefits, we decided to upgrade the offline data warehouse HDFS cluster to HDFS 3.x version.

2. HDFS upgrade version selection

Since our Hadoop cluster is built on CDH 5.14.4, we first consider upgrading to a higher CDH version. CDH 7 provides HDFS 3.x distribution. Unfortunately, CDH 7 does not have a free version. We can only choose to upgrade to the Apache version or the HDP distribution provided by Hortonworks.

Since Apache Hadoop does not provide management tools, it is extremely inconvenient to manage and distribute configurations for HDFS clusters with a scale of 10,000. Therefore, we chose the Hortonworks HDP distribution and Ambari for the HDFS management tool.

The latest stable free Hadoop distribution provided by Hortonworks is HDP-3.1.4.0-315. Hadoop version is Apache Hadoop 3.1.1 version.

3. Development of HDFS upgrade plan

3.1 Upgrade plan

HDFS officially provides two upgrade schemes: Express and RollingUpgrade .

The Express upgrade process is to stop the existing HDFS service, and then use the new version of HDFS to start the service, which will affect the normal operation of online business.
The RollingUpgrade upgrade process is a rolling upgrade, non-stop service, and no user perception.

In view of the great impact of HDFS service suspension on the business, we finally chose the RollingUpgrade plan.

3.2 Downgrade plan

In the RollingUpgrade scheme, there are two rollback methods: Rollback and RollingDowngrade .

Rollback will roll back the HDFS version together with the data state to the moment before the upgrade, which will cause data loss.
RollingDowngrade only rolls back the HDFS version, the data is not affected.

Our online HDFS cluster cannot tolerate data loss, and we finally chose the RollingDowngrade fallback solution.

3.3 HDFS client upgrade plan

Online computing components such as Spark, Hive, Flink, and OLAP rely heavily on HDFS Client. Some computing components have an outdated version and need to be upgraded to a higher version to support HDFS 3.x. Upgrading HDFS Client has high risks.

After several rounds of testing in the test environment, we verified that HDFS 3.x is compatible with HDFS 2.x client read and write.

Therefore, in this HDFS upgrade, we only upgrade the NameNode, JournalNode, and DataNode components, and upgrade after YARN upgrades such as HDFS 2.x Client.

3.4 HDFS rolling upgrade steps

The operation process of RollingUpgrade upgrade is described in the official Hadoop upgrade document. The general steps are as follows:

The JournalNode is upgraded, and the JournalNode is restarted in turn with the new version.
NameNode upgrade preparation, generate rollback fsimage file.
Restart the Standby NameNode with the new version of Hadoop and restart ZKFC.
NameNode HA master-slave switch, so that the upgraded NameNode becomes the Active node.
Restart another NameNode with the new version of Hadoop and restart ZKFC.
Upgrade the DataNodes, rolling restart all DataNodes with the new version of Hadoop.
Run Finalize to confirm that the HDFS cluster is upgraded to the new version.

4. How management tools can coexist

HDFS 2.x cluster, HDFS, YARN, Hive, HBase and other components are managed by CM tools. Since only HDFS is upgraded, HDFS 3.x is managed by Ambari, and other components such as YARN and Hive are still managed by CM. HDFS 2.x client will not be upgraded, and continue to use CM management. Zookeeper uses the ZK deployed by the original CM.

Specific implementation: CM Server node deploys Amari Server, and CM Agent node deploys Ambari Agent.

As shown in the figure above, use the Ambari tool to deploy HDFS 3.x NameNode/DataNode components on the master/slave nodes. Due to port conflicts, HDFS 3.x deployed by Ambari will fail to start, and will not be able to deploy HDFS 2.x deployed by online CM. Clusters have an impact.

After the HDFS upgrade starts, the master node stops CM JN/ZKFC/NN and starts Ambari JN/ZKFC/NN, and the slave node stops CM DN and starts Ambari DN. The management tool was switched from CM to Ambari at the same time as the HDFS upgrade.

5. Problems encountered during HDFS rolling upgrade and downgrade

5.1 Incompatibilities fixed by the HDFS community

The HDFS community has fixed critical incompatibilities during rolling upgrades and downgrades. The relevant issue numbers are: HDFS-13596 , HDFS-14396 , HDFS-14831 .

【 HDFS-13596 】: After the Active NamNode is upgraded, the EC-related data structure is written to the EditLog file, causing the Standby NameNode to read the EditLog abnormally and directly shut down.

[ HDFS-14396 ]: After the NameNode is upgraded to HDFS 3.x, the EC-related data structure is written into the Fsimage file, which causes the NameNode to be downgraded to HDFS 2.x to recognize the abnormality of the Fsimage file.

【 HDFS-14831 】: Fix the incompatibility of Fsimage after HDFS downgrade caused by modification of StringTable after NameNode upgrade.

Our upgraded HDP HDFS version introduces the above three issue-related codes. In addition to this, we encountered other incompatibilities during the upgrade process:

5.2 Unknown protocol appears in JournalNode upgrade

During the upgrade of JournalNode, the problems occurred:

Unknown protocol: org.apache.hadoop.hdfs.qjournal.protocol.InterQJournalProtocol

 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchProtocolException): Unknown protocol: org.apache.hadoop.hdfs.qjournal.protocol.InterQJournalProtocol
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.getProtocolImpl(ProtobufRpcEngine.java:557)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:596)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1444)
        at org.apache.hadoop.ipc.Client.call(Client.java:1354)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy14.getEditLogManifestFromJournal(Unknown Source)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.InterQJournalProtocolTranslatorPB.getEditLogManifestFromJournal(InterQJournalProtocolTranslatorPB.java:75)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.syncWithJournalAtIndex(JournalNodeSyncer.java:250)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.syncJournals(JournalNodeSyncer.java:226)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.lambda$startSyncJournalsDaemon$0(JournalNodeSyncer.java:186)
        at java.lang.Thread.run(Thread.java:748)

Reason for error: HDFS 3.x adds InterQJournalProtocol, which is used to synchronize old edits data between JournalNodes.

HDFS-14942 optimizes this issue and changes the log level from ERROR to DEBUG. This problem does not affect the upgrade. When all three HDFS 2.x JNs are upgraded to HDFS 3.x JNs, data can be synchronized between JNs normally.

5.3 NameNode upgrade DatanodeProtocol.proto is incompatible

After the NameNode is upgraded, the DatanodeProtocol.proto is incompatible, causing the Datanode BlockReport to fail.

(1) HDFS 2.6.0 version

DatanodeProtocol.proto

 message HeartbeatResponseProto {
  repeated DatanodeCommandProto cmds = 1; // Returned commands can be null
  required NNHAStatusHeartbeatProto haStatus = 2;
  optional RollingUpgradeStatusProto rollingUpgradeStatus = 3;
  optional uint64 fullBlockReportLeaseId = 4 [ default = 0 ];
  optional RollingUpgradeStatusProto rollingUpgradeStatusV2 = 5;
}

(2) HDFS 3.1.1 version

DatanodeProtocol.proto

 message HeartbeatResponseProto {
  repeated DatanodeCommandProto cmds = 1; // Returned commands can be null
  required NNHAStatusHeartbeatProto haStatus = 2;
  optional RollingUpgradeStatusProto rollingUpgradeStatus = 3;
  optional RollingUpgradeStatusProto rollingUpgradeStatusV2 = 4;
  optional uint64 fullBlockReportLeaseId = 5 [ default = 0 ];
}

We can see that the positions of the 4th and 5th parameters of the two versions of HeartbeatResponseProto have been swapped .

The reason for this problem is that Hadoop 3.1.1 version committed HDFS-9788 to solve the problem of compatibility with lower versions during HDFS upgrades, while HDFS 2.6.0 version did not commit, resulting in DatanodeProtocol.proto incompatibility.

During the HDFS upgrade process, you do not need to be compatible with the lower version of HDFS, but only need to be compatible with the lower version of HDFS client.

Therefore, HDFS 3.x does not require HDFS-9788 to be compatible with lower versions. We rolled back the modifications of HDFS-9788 in Hadoop 3.1.1 to maintain DatanodeProtocol.proto compatibility with HDFS 2.6.0.

5.4 NameNode upgrade layoutVersion is not compatible

After the NameNode is upgraded, the NameNode layoutVersion is changed, resulting in incompatibility of EditLog, and HDFS 3.x downgrade to HDFS 2.x NameNode cannot be started.

 2021-04-12 20:15:39,571 ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception initializing XXX:8480/getJournal
id=test-53-39&segmentTxId=371054&storageInfo=-60%3A1589021536%3A0%3Acluster7
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$LogHeaderCorruptException: Unexpected version of the file system log file: -64. Current version = -60.
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.readLogVersion(EditLogFileInputStream.java:397)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.init(EditLogFileInputStream.java:146)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextopImpl(EditLogFileInputStream.java:192)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextop(EditLogFileInputStream.java:250)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.read0p(EditLogInputStream.java:85)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
        at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.next0p(RedundantEditLogInputStream.java:178)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readop(EditLogInputStream.java:85)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
        at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.next0p(RedundantEditLogInputStream.java:178)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.read0p(EditLogInputStream.java:85)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.LoadEditRecords(FSEditLogLoader.java:188)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.LoadFSEdits(FSEditLogLoader.java:141)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:903)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.LoadFSImage(FSImage.java:756)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:324)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.LoadFSImage(FSNamesystem.java:1150)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.LoadFromDisk(FSNamesystem.java:797)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.LoadNamesystem (NameNode.java:614)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode (NameNode.java:1547)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)

HDFS 2.6.0 was upgraded to HDFS 3.1.1, and the NameNode layoutVersion value was changed from -60 to -64. To solve this problem, first find out under what circumstances will the NameNode layoutVersion change?

The HDFS version upgrade introduces new features, and the NameNode layoutVersion changes with the new features. The official Hadoop upgrade document points out that new features should be disabled during the HDFS rolling upgrade process, and the layoutVersion should remain unchanged during the upgrade process, and the upgraded HDFS 3.x version can be rolled back to the HDFS 2.x version.

Next, find out which new feature was introduced by the upgrade from HDFS 2.6.0 to HDFS 3.1.1 that caused the namenode layoutVersion to change? Check HDFS-5223 , HDFS-8432 , HDFS-3107 related issues, HDFS 2.7.0 version introduces truncate function, NameNode layoutVersion becomes -61. Check out the HDFS 3.x version NameNodeLayoutVersion code:

NameNodeLayoutVersion

 public enum Feature implements LayoutFeature {
  ROLLING_UPGRADE(-55, -53, -55, "Support rolling upgrade", false),
  EDITLOG_LENGTH(-56, -56, "Add length field to every edit log op"),
  XATTRS(-57, -57, "Extended attributes"),
  CREATE_OVERWRITE(-58, -58, "Use single editlog record for " +
    "creating file with overwrite"),
  XATTRS_NAMESPACE_EXT(-59, -59, "Increase number of xattr namespaces"),
  BLOCK_STORAGE_POLICY(-60, -60, "Block Storage policy"),
  TRUNCATE(-61, -61, "Truncate"),
  APPEND_NEW_BLOCK(-62, -61, "Support appending to new block"),
  QUOTA_BY_STORAGE_TYPE(-63, -61, "Support quota for specific storage types"),
  ERASURE_CODING(-64, -61, "Support erasure coding");

TRUNCATE, APPEND\_NEW\_BLOCK, QUOTA\_BY\_STORAGE\_TYPE, ERASURE\_CODING four Features set minCompatLV to -61.

View the final NameNode layoutVersion value logic:

FSNamesystem

 static int getEffectiveLayoutVersion(boolean isRollingUpgrade, int storageLV,
    int minCompatLV, int currentLV) {
  if (isRollingUpgrade) {
    if (storageLV <= minCompatLV) {
      // The prior layout version satisfies the minimum compatible layout
      // version of the current software.  Keep reporting the prior layout
      // as the effective one.  Downgrade is possible.
      return storageLV;
    }
  }
  // The current software cannot satisfy the layout version of the prior
  // software.  Proceed with using the current layout version.
  return currentLV;
}

getEffectiveLayoutVersion gets the final effective layoutVersion, storageLV is the current HDFS 2.6.0 version layoutVersion -60, minCompatLV is -61, and currentLV is the upgraded HDFS 3.1.1 layoutVersion -64.

It can be seen from the code judgment logic that the HDFS 2.6.0 version layoutVersion -60 is less than or equal to the minCompatLV is -61, which is not valid. Therefore, after upgrading to the HDFS 3.1.1 version, the value of the namenode layoutVersion is currentLV -64.

It can be seen from the above code analysis that after the truncate function was introduced in HDFS 2.7.0, the HDFS community only supports the NameNode layoutVersion which is compatible with HDFS 3.x downgrade to HDFS 2.7.

We evaluate the HDFS truncate function, combined with the business scenario analysis, we have not used the HDFS truncate function in vivo's internal offline analysis. Based on this, we modified the minCompatLV of HDFS 3.1.1 to -60 to support HDFS 2.6.0 to be downgraded to HDFS 2.6.0 after upgrading to HDFS 3.1.1.

minCompatLV is modified to -60:

NameNodeLayoutVersion

 public enum Feature implements LayoutFeature {
  ROLLING_UPGRADE(-55, -53, -55, "Support rolling upgrade", false),
  EDITLOG_LENGTH(-56, -56, "Add length field to every edit log op"),
  XATTRS(-57, -57, "Extended attributes"),
  CREATE_OVERWRITE(-58, -58, "Use single editlog record for " +
    "creating file with overwrite"),
  XATTRS_NAMESPACE_EXT(-59, -59, "Increase number of xattr namespaces"),
  BLOCK_STORAGE_POLICY(-60, -60, "Block Storage policy"),
  TRUNCATE(-61, -60, "Truncate"),
  APPEND_NEW_BLOCK(-62, -60, "Support appending to new block"),
  QUOTA_BY_STORAGE_TYPE(-63, -60, "Support quota for specific storage types"),
  ERASURE_CODING(-64, -60, "Support erasure coding");

5.5 DataNode upgrade layoutVersion is not compatible

After the DataNode is upgraded, the DataNode layoutVersion is incompatible, and the HDFS 3.x DataNode is downgraded to HDFS 2.x DataNode cannot be started.

 2021-04-19 10:41:01,144 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/data/dfs/dn/
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/dfs/dn. Reported: -57. Expecting = -56.
        at org.apache.hadoop.hdfs.server.common.StorageInfo.setLayoutVersion(StorageInfo.java:178)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.setFieldsFromProperties(DataStorage.java:665)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.setFieldsFromProperties(DataStorage.java:657)
        at org.apache.hadoop.hdfs.server.common.StorageInfo.readProperties(StorageInfo.java:232)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:759)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.LoadStorageDirectory(DataStorage.java:302)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.LoadDataStorage(DataStorage.java:418)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:397)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:575)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1560)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBLockPool(DataNode.java:1520)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:341)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:219)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
        at java.lang.Thread.run(Thread.java:748)

HDFS 2.6.0 DataNode layoutVersion is -56, HDFS 3.1.1 DataNode layoutVersion is -57.

The reason for the change of DataNode layoutVersion: The Hadoop community has upgraded the DataNode layout since HDFS-2.8.0 committed HDFS-8791 , and the DataNode Block Pool data block directory storage structure has changed from 256 x 256 directories to 32 x 32 directories. . The purpose is to optimize the performance issues caused by Du operations by reducing the DataNode directory hierarchy.

DataNode Layout upgrade process:

Rename the current current directory to previous.tmp.
Create a new current directory and create a hardlink from previous.tmp to the new current directory.
The rename directory previous.tmp is the previous directory.

Layout upgrade flow chart:

Store directory structure during DN Layout upgrade:

The link association pattern diagram of hardlink:

Looking at the DataNodeLayoutVersion code, the layoutVersion that defines a 32 x 32 directory structure is -57. Note that the DataNode Layout upgrade needs to change the layoutVersion.

DataNodeLayoutVersion

 public enum Feature implements LayoutFeature {
  FIRST_LAYOUT(-55, -53, "First datanode layout", false),
  BLOCKID_BASED_LAYOUT(-56,
      "The block ID of a finalized block uniquely determines its position " +
      "in the directory structure"),
  BLOCKID_BASED_LAYOUT_32_by_32(-57,
      "Identical to the block id based layout (-56) except it uses a smaller"
      + " directory structure (32x32)");

We found the following problems when upgrading the DataNode layout in the test environment: the process of creating a new current directory and establishing a hardlink for the DataNode is very time-consuming, and it takes 5 minutes for a DataNode with 1 million blocks to start from the layout upgrade to provide read and write services to the outside world. This is unacceptable for our HDFS cluster with nearly 10,000 DataNodes, and it is difficult to complete the upgrade of DataNodes within the scheduled upgrade time window.

Therefore, we rolled back HDFS-8791 in HDFS 3.1.1, and DataNode does not perform Layout upgrade. The test found that it only takes 90 to 180 seconds to upgrade a DataNode with 1 to 2 million blocks, which is significantly shorter than that of Layout.

HDFS-8791 is rolled back, how to solve the performance problem caused by DataNode Du?

We combed the patch of HDFS 3.3.0 version and found that HDFS-14313 calculates the space used by DataNode from memory and no longer uses Du operation, which perfectly solves the performance problem of DataNode Du. We entered HDFS-14313 in the upgraded HDFS 3.1.1 version, which solved the io performance problem caused by the Du operation after the DataNode was upgraded.

5.6 DataNode Trash directory processing

As shown in the figure above, during the DataNode upgrade process, when the DataNode deletes the Block, it will not actually delete the Block, but first put the Block file in a trash directory under the BlockPool directory on the disk, in order to be able to use the original rollback_fsimage to restore Data deleted during the upgrade process. The average water level of our cluster disk has always been 80%, which is very tight. During the upgrade, a large number of Block files in the trash will pose a great threat to the stability of the cluster.

Considering that the rollback method of our solution is rolling downgrade instead of Rollback, the Block in the trash will not be used. So we use the script to delete the Block file in the trash regularly, which can greatly reduce the storage pressure on the disk on the Datanode.

5.7 Other issues

The above are all the incompatibilities we encountered during the HDFS upgrade and downgrade process. In addition to incompatibility issues, we also introduced some NameNode RPC optimization patches in the updated HDP HDFS 3.1.1 version.

The FoldedTreeSet red-black tree data structure of HDFS 2.6.0 version causes the RPC performance to degrade after the NameNode runs for a period of time, and a large number of StaleDataNodes appear in the cluster, causing the task to fail to read blocks. Hadoop 3.4.0 HDFS-13671 fixes this issue and falls back to the original LightWeightResizableGSet linked list data structure for FoldedTreeSet. We also introduced the HDFS-13671 patch into our upgraded HDP HDFS 3.1.1 release.

The optimization effect of HDFS-13671 after the upgrade: the number of StaleDataNodes in the cluster is greatly reduced.

6. Test and launch

In March 2021, we launched the HDFS upgrade project for offline warehouse clusters, and built multiple HDFS clusters in the test environment to conduct multiple rounds of HDFS upgrade and downgrade drills in viewfs mode. Continuously summarize and improve the upgrade plan to solve the problems encountered in the upgrade process.

6.1 Full component HDFS client compatibility test

In the HDFS upgrade, only the server side is upgraded, and the HDFS client is still HDFS 2.6.0. Therefore, we need to ensure that the business can read and write the HDFS 3.1.1 cluster normally through the HDFS 2.6.0 Client.

In the test environment, we built an HDFS test cluster with a similar online environment, and worked with colleagues in the computing group and business departments to use HDFS 2.6.0 Client to read and write HDFS 3.1 for Hive, Spark, OLAP (kylin, presto, druid), and algorithm platforms. .1, simulating the online environment to conduct the compatibility test of the full range of services. Confirm that the HDFS 2.6.0 Client can read and write to the HDFS 3.1.1 cluster normally, and the compatibility is normal.

6.2 Scripting upgrade operations

We have strictly sorted out the HDFS upgrade and downgrade commands, and sorted out the risks and precautions of each step. Start and stop the HDFS service through CM and Ambari API. Organize these operations into python scripts to reduce the risk of manual operations.

6.3 Upgrade check

We have sorted out the key inspection items in the HDFS upgrade process to ensure that problems in the HDFS upgrade process can be found at the first time, rolled back, and the impact on the business.

6.4 Official upgrade

We have carried out several HDFS upgrade and downgrade drills in the test environment, and completed the work related to the HDFS compatibility test. The company has written several WIKI documents for records.

After confirming that the test environment HDFS upgrade and downgrade is no problem, we started the upgrade road.

The relevant specific milestones are as follows:

From March to April 2021 , sort out the new features and related patches of HDFS 3.x version, read the source code of HDFS rolling upgrade and downgrade, and determine the final upgraded HDFS 3.x version. Completion of HDFS 2.x existing optimized patch and HDFS 3.x high version patch porting to the upgraded HDFS 3.x version.
From May to August 2021, the HDFS upgrade and downgrade drill will be conducted, and the compatibility of Hive, Spark, and OLAP (kylin, presto, and druid) will be tested, and it will be confirmed that the HDFS upgrade and downgrade plan is correct.
In September 2021 , the yarn log aggregation HDFS cluster (hundreds of units) was upgraded to HDP HDFS 3.1.1. During this period, the RPC performance problem caused by a large number of ls calls for log aggregation was fixed, and the business was not affected.
In November 2021 , 7 offline warehouse HDFS clusters (about 5,000 units) were upgraded to HDP HDFS 3.1.1, users were not aware of it, and services were not affected.
In January 2022, the offline data warehouse HDFS cluster (the scale of 10 clusters is close to 10,000 units) will be upgraded to HDP HDFS 3.1.1, users are not aware of it, and the business is not affected.

After the upgrade, we observed each cluster in the offline data warehouse. Currently, the HDFS service is running normally.

7. Summary

It took us one year to upgrade the HDFS cluster of 10,000 offline warehouses from CDH HDFS 2.6.0 to HDP HDFS 3.1.1, and the management tool was successfully switched from CM to Ambari.

The HDFS upgrade process is long, but the benefits are very large. The HDFS upgrade has laid the foundation for the subsequent upgrade of YARN, Hive/Spark, and HBase components.

On this basis, we can continue to do very meaningful work, continue to explore in depth in terms of stability, performance, cost, etc., and use technology to create visible value for the company.

References

Vivo's 10,000-scale HDFS cluster upgrade HDFS 3.x practice

1. Background

2. HDFS upgrade version selection

3. Development of HDFS upgrade plan

3.1 Upgrade plan

3.2 Downgrade plan

3.3 HDFS client upgrade plan

3.4 HDFS rolling upgrade steps

4. How management tools can coexist

5. Problems encountered during HDFS rolling upgrade and downgrade

5.1 Incompatibilities fixed by the HDFS community

5.2 Unknown protocol appears in JournalNode upgrade

5.3 NameNode upgrade DatanodeProtocol.proto is incompatible

5.4 NameNode upgrade layoutVersion is not compatible

5.5 DataNode upgrade layoutVersion is not compatible

5.6 DataNode Trash directory processing

5.7 Other issues

6. Test and launch

6.1 Full component HDFS client compatibility test

6.2 Scripting upgrade operations

6.3 Upgrade check

6.4 Official upgrade

7. Summary

vivo互联网技术

引用和评论

活动中台系统慢 SQL 治理实践

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 Flink 进行增量批计算的探索与实践

从企业级 RAG 到 AI Assistant，阿里云 Elasticsearch AI 搜索技术实践

基于 pyflink 的算法工作流设计和改造

湖仓实时化升级：Uniflow 构建流批一体实时湖仓

Vivo&#39;s 10,000-scale HDFS cluster upgrade HDFS 3.x practice

1. Background

2. HDFS upgrade version selection

3. Development of HDFS upgrade plan

3.1 Upgrade plan

3.2 Downgrade plan

3.3 HDFS client upgrade plan

3.4 HDFS rolling upgrade steps

4. How management tools can coexist

5. Problems encountered during HDFS rolling upgrade and downgrade

5.1 Incompatibilities fixed by the HDFS community

5.2 Unknown protocol appears in JournalNode upgrade

5.3 NameNode upgrade DatanodeProtocol.proto is incompatible

5.4 NameNode upgrade layoutVersion is not compatible

5.5 DataNode upgrade layoutVersion is not compatible

5.6 DataNode Trash directory processing

5.7 Other issues

6. Test and launch

6.1 Full component HDFS client compatibility test

6.2 Scripting upgrade operations

6.3 Upgrade check

6.4 Official upgrade

7. Summary

vivo互联网技术

引用和评论

活动中台系统慢 SQL 治理实践

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 Flink 进行增量批计算的探索与实践

从企业级 RAG 到 AI Assistant，阿里云 Elasticsearch AI 搜索技术实践

基于 pyflink 的算法工作流设计和改造

湖仓实时化升级 ：Uniflow 构建流批一体实时湖仓

Vivo's 10,000-scale HDFS cluster upgrade HDFS 3.x practice

湖仓实时化升级：Uniflow 构建流批一体实时湖仓