​Introduction | The traditional HADOOP ecosystem uses YARN to manage/schedule computing resources, and the system generally has an obvious resource usage cycle. The real-time computing cluster resource consumption is mainly in white days, while the data report type business is arranged in the offline computing cluster. The primary problem of separate deployment from online services is low resource utilization and high consumption costs. With the growth of business and sudden report calculation requirements, in order to solve the problem of reserving resources for offline clusters, Tencent Cloud EMR team and container team jointly launched Hadoop Yarn on Kubernetes Pod to improve container resource utilization and reduce resource costs. Increase the CPU usage of container clusters in idle time by several times. This article mainly introduces the optimization and practice of HADOOP resource scheduler YARN in the container environment.

1. Hadoop Yarn on Kubernetes Pod hybrid deployment mode

The Hadoop Yarn on Kubernetes Pod solution provides two functions: elastic expansion and contraction and off-line hybrid deployment. Elastic scaling mainly focuses on how to use cloud native resources and quickly expand resources to supplement computing power. The purpose of the off-line hybrid deployment mode is to make full use of the idle resources of the online cluster and minimize the frequency of reserving idle resources for the offline cluster.

EMR elastic expansion and contraction module (yarn-autoscaler) provides two expansion and contraction methods according to load and according to time. For load scaling, users can set thresholds for different indicators to trigger scaling, such as setting available vcore, pending vcore, available mem, and pending mem in the Yarn queue. You can also use time expansion and contraction rules to specify triggers by day, week, and month.

When the elastic rule is triggered, the offline deployment module obtains the specifications and amount of idle computing power that can be provided in the current online TKE cluster, calls the Kubernetes api to create the corresponding number of resources, and the ex-scheduler expands the scheduler to ensure that the Pod is created in the remaining resources On more nodes, the POD is responsible for starting YARN services.

Through this solution, Yarn's NodeManager service can be quickly deployed to POD nodes. However, Yarn's native scheduling does not consider heterogeneous resources, which causes two problems:

1. AM's POD was expelled, causing APP to fail

Under the condition of the resource shortage of the node node, kubelet triggers the mechanism of actively expelling pods in order to ensure the stability of the node node. If there is an AM service on the node, the entire Application will be deemed to have failed, and the ResourceManager will re-allocate AM at this time. For tasks with a large amount of calculation, the cost of Application reruns is unbearable.

2. The limitations of Yarn's native non-exclusive partition resource sharing

Yarn's label partition feature supports exclusive and non-exclusive partitions.

  • Exclusive partition (Exclusive): For example, if you specify an exclusive partition x, Yarn's container will only be allocated to the x partition.
  • Non-exclusive partition (Non-exclusive): For example, the resources of non-exclusive partition x and x partition can be shared with the default partition.

Only when the partition default is specified, the Application running on default can use the resources of partition x.

However, in actual use scenarios, users need to allocate their own exclusive partition resources to each business department, and at the same time, they will divide the default partition for use by each department. The default partition resources will be sufficient. The business department hopes to use its own exclusive partition and make full use of the default partition resources at the same time. When the exclusive partition resources and the default partition are not enough, it will trigger the elastic expansion and expand the capacity in its own exclusive partition. Resources.

2. Challenges to Yarn transformation

For the development of the above features, in addition to the difficulty of the technical nature of the requirements. It is also necessary to consider as much as possible to reduce the impact of the stability of the user stock cluster and reduce the cost of user business side transformation.

  • Cluster stability: Hadoop Yarn is the basic scheduling component in the big data system. If too many changes are made, the probability of failure will increase. At the same time, the features introduced must necessarily upgrade the Haoop Yarn of the existing cluster. The upgrade operation must be unaware of the existing business clusters and cannot affect the business of the day.
  • Business-side usage cost: The new features introduced must also conform to the original yarn usage habits to facilitate the understanding of business-side users and reduce code modification on the business-side.

1. AM chooses storage media independently

At present, the Yarn community does not consider the characteristics of mixed deployment of heterogeneous resources on the cloud. In an online TKE cluster, containers will be evicted when resources are tight. In order to avoid Appliction recalculation and waste of resources, it is necessary to provide AM that can specify whether to allocate POD type resources.

In the independent choice of storage medium, the configuration identifier is used, and NodeManager reports whether resources can be provided to AM through RPC. ResourceManager decides to allocate the AM of the Application to the stable resource medium through the reported information. The benefits of reporting information by NodeManager through configuration are obvious:

  • Decentralization: Reduce ResourceManager processing logic. Otherwise, when expanding the resource, the resource information needs to be flowed into the ResourceManager through RPC/configuration. If it is not necessary, do not increase the entity, and the transformation of ResourceManager should be lightweight.
  • Cluster stability: After the stock service cluster upgrades Yarn, you need to restart the NodeManager, and only need to restart the ResourceManager. Yare's high availability feature ensures that the upgrade process has no impact on the business. The reason why there is no need to restart NodeManager is that NM treats local resources as assignable by default.
  • Simple and easy to use: Users can freely determine task resources to have the right to allocate AM through configuration, not only limited to POD container resources.

2. Multi-tag dynamic allocation of resources

In Yarn's native label design, only a single label can be included in the label expression when submitting a task. If multiple partition resources are used at the same time in order to improve utilization, the non-default partition must be set to the Non-exclusive feature. The label expression must solve the following three problems:

  • Resource isolation: After partition A is set to Non-exclusive, resources cannot be exchanged to the apps of partition A in time after the resources are occupied by apps on other partitions.
  • Free sharing of resources: Only the default partition is eligible to apply for non-exclusive partition resources.
  • Dynamic selection of partition resources: When multiple partition resources are shared, the available area cannot be selected based on the remaining resource size of the partition, which affects the efficiency of task execution.

The Tencent Cloud EMR team adds support for logical operator expressions by supporting extended expression syntax, so that the App can apply for multiple partition resources. At the same time, a resource statistics module is developed to dynamically count the available resources of the partition, and allocate the most suitable partition for the App.

3. Practical drill

Test environment: Specify the NodeManager of 172.17.48.28/172.17.48.17 as the default partition, and the NodeManager of 172.17.48.29/172.17.48.26 as the x partition.

Queue settings:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>a,b</value>
</property>
​
​
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.x.capacity</nam e>
<value>100</value>
</property>
​
​
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.y.capacity</nam e>
<value>100</value>
</property>
​
​
<!-- configuration of queue-a -->
<property>
<name>yarn.scheduler.capacity.root.a.accessible-node-labels</name>
<value>x</value>
</property>
​
​
<property>
<name>yarn.scheduler.capacity.root.a.capacity</name>
<value>50</value>
</property>
​
​
<property>
<name>yarn.scheduler.capacity.root.a.accessible-node-labels.x.capacity</n ame>
<value>100</value>
</property>
​
​
<!-- configuration of queue-b -->
<property>
<name>yarn.scheduler.capacity.root.b.accessible-node-labels</name>
<value>y</value>
</property>
​
​
<property>
<name>yarn.scheduler.capacity.root.b.capacity</name>
<value>50</value>
</property>
​
​
<property>
<name>yarn.scheduler.capacity.root.b.accessible-node-labels.y.capacity</n ame>
<value>100</value>
</property>
​
​
</configuration>

1. It is stipulated that AM can only be allocated at 172.17.48.28

Configure the following configuration items for the NodeManager nodes of the other three nodes:

yarn.nodemanager.am-alloc-disabled = true

After configuration, the AM of the submitted Application can only be started on the 172.17.48.28 node.




2. Use combination tags

Specify the label expression through mapreduce.job.node-label-expression, x|| means that the x/default partition is used at the same time.

hadoop jar /usr/local/service/hadoop/share/hadoop/mapreduce/hadoop-mapredu ce-examples-3.1.2.jar pi -D mapreduce.job.queuename="a" -D mapreduce.job. node-label-expression="x||" 10 10

After submitting with this command, it was observed that the Application container was allocated in the x/default partition.

Fourth, Hadoop Yarn on Kubernetes Pod best practices

The customer's big data application and storage runs on the big data cluster managed by Yarn. In the production environment, it faces many problems, mainly reflected in the insufficient computing power of big data and the waste of resources during the trough of online business. For example, when offline computing power is insufficient, the punctuality of data cannot be guaranteed, especially when it encounters random urgent big data query tasks, and there is no available computing resources, existing computing tasks can only be stopped or waiting for existing tasks to be completed. Regardless of which method, the efficiency of the overall task execution will be greatly reduced.

Based on the Hadoop Yarn on Kubernetes Pod solution, offline tasks are automatically expanded to clusters on the cloud, and deployed with TKE online business clusters, making full use of idle resources during the trough period on the cloud, increasing the computing power of offline businesses, and using cloud resources quickly Flexible capacity expansion, timely supplement of offline computing power.

After optimizing the customer's online TKE cluster resource usage through the Hadoop Yarn on Kubernetes Pod project, the CPU usage of the cluster during idle time can be increased by 500%.

在线集群闲时CPU占用

离在线混部后CPU占用

Five, summary

This paper proposes the optimization and practice of cloud-native containerization based on YARN. In the hybrid deployment cloud-native environment, the stability and efficiency of task operation are greatly improved, the utilization rate of cluster resources is effectively improved, and the hardware cost is saved. In the future, we will explore more big data cloud-native scenarios to bring more practical benefits to enterprise customers.

About the Author

Zhang Hui, senior engineer of Tencent Cloud, is currently responsible for the management and control modules of Tencent Cloud's big data product elastic MapReduce, and the technology research and development of the important component Hive. Contributed code to Apache Hive, Apache Calcite open source project, graduated from University of Electronic Science and Technology of China.


腾讯云开发者
21.9k 声望17.3k 粉丝