author

Lv Yalin, the head of the R&D team of the job help infrastructure-architecture. Responsible for technical mid-stage and infrastructure work. During the job help period, he led the evolution of cloud native architecture, promoted the implementation of containerization transformation, service governance, GO microservice framework, and DevOps.

Another way, job help infrastructure-senior research and development engineer, during the job help, responsible for the construction of multi-cloud K8s cluster, K8s component development, Linux kernel optimization and tuning related work.

background

In the process of cloud-native containerization transformation, as the scale of clusters becomes larger and the scenarios of mixed business deployment become more and more complex, it faces more and more cluster problems, and it has reached the deep water area of Kubernetes and containerization. , Especially after tens of thousands of CronJobs are containerized and deployed in the same production cluster with online services, the problem becomes more obvious.

The online production business of Job Help is deployed on the Blackstone 2.0 physical machine using TKE. The size of a single machine is relatively large, and there are more pods deployed. The feature of cronjob is frequent, regular startup and destruction, and it also needs to be reserved for this part of the business. A certain fixed resource, so there are two main problems in this area; one is the node stability problem caused by the weak isolation of cgroup in the scene of frequent creation and destruction of large-scale pods, which affects other businesses of the same node, and the other is resource reservation. The problem of low resource utilization. These two problems are actually beyond the scope of native Kubernetes capabilities, and we need new ideas to solve them.

The causes and solutions of these two problems will be described in detail below.

Question 1: The stability of the nodes in the cluster

Due to the fact that there are many minute-level scheduled tasks in the business, the creation and destruction of pods are very frequent. On average, a single node has hundreds of containers created and destroyed every minute, and machine stability problems frequently occur.

A typical problem is that frequent pod creation causes too many cgroups on the node, especially memory cgroups cannot be recycled in time, reading /sys/fs/cgroup/memory/memory.stat slows down, because kubelet will periodically read this file for statistics For the memory consumption of each cgroup namespace, the CPU core state gradually rises. When it rises to a certain level, some CPU cores will fall into the core state for a long time, resulting in significant network packet sending and receiving delays.

In the node perf record cat /sys/fs/cgroup/memory/memory.stat and perf report, you will find that the CPU is mainly consumed by memcg_stat_show:

img

The memcg_stat_show function of cgroup-v1 traverses the memcg tree multiple times for each CPU core, and when the number of nodes of a memcg tress reaches hundreds of thousands, the time-consuming is disastrous.

Why is the memory cgroup not released immediately with the destruction of the container? The main reason is that when the memory cgroup is released, it will traverse all cache pages, which may be very slow. The kernel will reclaim the memory when it is needed. When all memory pages are cleared, the corresponding memory cgroup will be released. On the whole, this strategy is to amortize the time consumption of direct overall recycling through delayed recycling. Generally, there will not be too many containers created on one machine. Usually a few hundred to a few thousand are basically no problem, but in large-scale timing In the task scenario, hundreds of containers are created and destroyed on a machine every minute, but there is no memory pressure on the node, and the memory cgroup is not recycled. After a period of time, the number of memory cgroups on the machine reaches hundreds of thousands. It took more than ten seconds to fetch memory.stat once, and the CPU core state increased sharply, causing significant network delays.

img

In addition, dockerd has high load, slow response, and kubelet PLEG timeout, which leads to node unready and other problems.

Question 2: The utilization of node resources in the cluster

Since we are using the TKE vpc-cni network mode, this network mode depends on the number of flexible network cards bound to the node, so there is an upper limit on the number of pod ips on a single node, and almost half of the podip of the node is reserved for pods of timed tasks , Resulting in a waste of ip, and the pod running time of timed tasks is generally very short, which leads to more idle resources reserved by the cluster for timed tasks, which is not conducive to the improvement of the overall machine resource utilization rate.

Other issues: scheduling speed, isolation between services

In certain periods, such as 0 o'clock every day, thousands of jobs will be generated at the same time and need to be run. The native scheduler is the allocation of cluster resources by the K8s scheduling pod itself, reflecting in the scheduling process that the pre-selection and scoring phases are performed sequentially, that is, serial. It takes a few minutes to complete the scheduling of thousands of Jobs, and most businesses require punctual operation at 00:00 or the business acceptance error is within 3s.

Some service pods are computationally or IO-intensive. This kind of service will preempt the node's CPU or IO in a large amount, and the isolation of cgroups is not complete, so it will interfere with the normal operation of other online services.

Solution ideas and plans

Therefore, we need a more thorough isolation method, finer-grained nodes, and faster scheduling mode for CronJob tasks.

In order to solve the appeal problem, we considered isolating the timed task pod from the pod of the ordinary online service. However, because many timed tasks need to be interoperable with services in the cluster, they cannot be isolated by clustering.

The virtual node provided by Tencent Cloud Elastic Container Service EKS provides us with a new idea to solve the appeal problem.

The virtual node of EKS is a kubernetes service in the form of serverless, which can be added to the existing TKE cluster. The pod deployed on the virtual node has the same network connectivity as the pod deployed on the normal TKE node. Pod is isolated at the vm level, and has the characteristics of no need to reserve resources and billing according to the amount, which can well meet the needs of our scenario. Therefore, we schedule CronJob services to virtual nodes, such as As shown in the figure:

img

Task scheduler

In order to solve the problem of slow default serial scheduling in K8s, we developed a task scheduler for job tasks. All CronJob workloads use task schedulers. Task schedulers schedule task pods to virtual nodes in parallel in batches to achieve large-scale pod tasks. MS-level scheduling also supports scheduling back to standard TKE nodes when virtual nodes fail or when resources are insufficient.

Solve the difference between the operation and maintenance methods of TKE nodes and virtual nodes

Before using virtual nodes, the difference between the virtual node pod and the pod running on the standard node must first be resolved, so as to be insensitive to business research and development.

Unified log collection

In terms of log collection, due to the nodeless form of EKS, DaemonSet cannot be run, and our log collection component runs in the form of DaemonSet, which requires a separate collection scheme for logs on virtual nodes. The EKS virtual node itself provides a log collection agent, which can collect and spit the standard input of the container to a Kafka topic, and then we will consume it in this topic.

Unified monitoring and alarm

In terms of monitoring, we have performed real-time CPU/memory/disk/network traffic monitoring on pods on virtual nodes, which is consistent with pods on ordinary nodes, exposing the export interface of pod sanbox, and promethus is responsible for unified collection and migration to virtual At the node time, the business is completely senseless.

Improve startup performance

Jobs on virtual nodes need to have a start-up speed of seconds to meet the start-up speed requirements of timed tasks. For example, the business requires 00:00:00 to run on time or the business acceptance error is within 3s.

mainly takes time in the following two steps:

  1. Business image pull acceleration
  2. Virtual node pod creation and initialization acceleration

For the first question: EKS provides the image cache function. The first time you pull it is slightly slower, it will be cached for a period of time after it is pulled down, and the same service does not need to pull the image again for the second start. All mirrors The problem of slow download is basically gone.

Aiming at the second question: the start-up time error of the business requirements is within 3s, so after communicating with the Tencent Cloud EKS team, we have carried out targeted optimization for this large-scale, high-frequency, and short-term computing operation scenario, and increased the frequency Start-up efficiency and reduce the time to initialize the operating environment.

Finally, the Pod on the virtual node can be started in seconds.

Summarize

Through the TKE + EKS virtual node method, we isolate normal online tasks from timed tasks, which effectively guarantees the stability of online business, combined with self-developed Job task scheduler, EKS mirror cache, pod startup acceleration and other capabilities to achieve task pod It is scheduled and started in seconds, and the TKE + virtual nodes are all standard K8s APIs, achieving smooth business migration. The final thing is that our fixed cluster no longer needs to reserve resources for CronJob tasks, which releases 10% of the resources in the cluster. Combined with the features of EKS on-the-go and pay-as-you-go, the resource cost of timed tasks is reduced. Around 70%.

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝