author
The author Tian Qi, a senior engineer at Tencent, focuses on large-scale offline mixing, distributed resource management and scheduling, is familiar with Kubernetes, and focuses on cloud-native big data and AI.
Lead
What is the offline mixing department
, big data, and artificial intelligence, in order to meet business needs, the IT environment of enterprises usually runs two types of services. One type is 1613099a2b6724 online services and the other is offline operations .
online service : often running for a long time, the service flow has periodic characteristics, and the overall resource utilization rate is not high, but it has extremely high requirements on the service SLA, such as web search services, e-commerce transaction services, etc.
Offline operation : It is often a resource-intensive service, but it can tolerate high latency and restart of failed tasks, such as big data analysis services, machine learning training services, etc.
These two types of service loads have great room for optimization in terms of time-division multiplexing and resource complementation, making them the preferred scenario for the mixing department. The so-called in the offline mixing department refers to combining offline operations and online services. Deploy to the same node to improve resource utilization and reduce the cost of the company’s increasing offline computing resources .
Co-location value offline
Improved resource utilization
According to Gartner statistics, the average utilization rate of global data centers is less than 15%, and there is a huge waste of resources every year.
The main reasons for the inefficiency of resources are as follows:
- business traffic is periodic . For online services, in order to ensure the business SLA during peak traffic, resources are often evaluated according to the highest peak value. For example, in the take-out business, 8-core CPU may be required during the peak period (meal time), but during the low peak period (night), resources may not be consumed, resulting in low resource utilization during most periods of time, resulting in waste.
- Cluster resource fragmentation , the so-called resource fragmentation, refers to the server still has certain static resources that have not been allocated, but because the resources of each dimension (such as CPU and ram) are not balanced at this time, there is no way to continue to allocate resources. Due to the current mainstream resource scheduling frameworks, static resource allocation algorithms are used to allocate resources, which will eventually cause resource fragmentation, which makes it impossible to effectively use resources.
- in the offline computer room . The granularity of the resource pool division is too coarse. Some companies will completely isolate the online computer room (mainly deploying online services such as Web) and offline computer room (mainly running offline clusters such as Hadoop). Under such a coarse granularity , There are a lot of resources in the online computer room that are idle and cannot be used by offline services, and vice versa. When the offline computer room is idle, online services cannot be fully utilized, and it is impossible to realize the intercommunication of resource pools between different IDCs.
can make full use of the idle resources of nodes by mixing offline, thereby improving resource utilization.
Cost optimization
The offline hybrid department is currently implemented in various large and medium-sized Internet companies. can improve resource utilization through the hybrid department, which can obtain considerable cost savings and obtain huge economic value under the effect of scale.
A simple calculation and analysis can be made below. We can increase the resource utilization rate by 20%, which can roughly save the budget:
Assuming that all our current machines have 10w core CPUs, and the average resource usage rate of each machine is 20%, then 0.2 10w = 2w cores, assuming the same business scale, assuming that the average resource usage rate increases to 40%, We only need 5w cores to meet business needs. Assuming that the average price of the CPU is 300 yuan/core/year, we can save 5w 300 = 1500w yuan/year.
For companies with cost control requirements, offline mixing is the first choice to reduce costs and increase efficiency. For example, Google has deployed all businesses in the Borg (the predecessor of Kubernetes) system. Its resource utilization rate can reach 60%, which can be achieved every year. Save hundreds of millions of dollars .
challenge
Scheduling guarantee
Resource reuse
The traditional mode of Kubernetes performs static scheduling according to the amount of request resources applied for by the business. If both online and offline services are scheduled according to the request, and the offline business is scheduled first, and it fills up the request resources of the node, then the online business cannot be scheduled. For the same reason, If the online service is scheduled first and fills up the node request resources, the offline service cannot be scheduled. The traditional mode scheduler will not be able to reuse the resources of the online service.
traditional resource multiplexing method often adopts time-sharing multiplexing, which is to run offline services at a fixed time point. For example, after the early morning, the offline services are scheduled to run, and the online services are scheduled during the day. In this mode of resource reuse, often too coarse in time granularity . Although online resources can be reused in a short period of time, there is a relatively strict time limit.
The other is resource reservation , which divides the resources of a machine into online resources, offline resources, and offline shared resources. This method uses a static division method to divide the resources of the whole machine and cannot perform flexible replication. Instead, the resources for online and offline services can only be reserved in advance. Although the deployment of offline services can improve resource utilization to a certain extent, the reuse is not sufficient, and machines with large resource specifications are required to statically divide resources.
Therefore, if you want efficiently and automate for fine-grained resource time-sharing multiplexing, you must have timely and accurate resource forecasting methods, the ability to quickly respond to resource changes, , and a set of 1613099a2b696e that can be used when the resource level changes. Service safeguard measures carried out.
Scheduling enhancement
Due to the differences in working modes between online and offline services, communities often use different schedulers for scheduling.
In the mixed scenario, the online scheduler and the offline scheduler are deployed in the cluster at the same time. When resources are tight, the scheduler will have resource conflicts and can only try again. At this time, the throughput and scheduling performance of the scheduler will be affected. Great impact, ultimately affecting the scheduling SLA.
At the same time, in large-scale batch scheduling scenarios, native Kubernetes cannot support it. It only supports online business scheduling.
Resource Guarantee
Online business and offline business originally belong to different types of work. Deploying these two loads on the same node will cause resource interference. The so-called resource interference means that when resources are tight or traffic bursts, the online business is in the resource Use will be interfered by offline services. most important goal of 1613099a2b69e2 in the offline mixing department is to improve the utilization rate of single machine resources while guaranteeing the service SLA online and offline business.
- For online services, it is necessary to ensure that during the peak traffic period and before there is no mixing, it can not cause large interference, and the interference rate needs to be reduced to less than 5%.
- For offline services, it is not always possible to be hungry or frequently expelled because the priority is not as good as that of online services, which will affect the total running time and SLA of offline services.
Resource isolation
The essence of a container is a restricted process. Processes are isolated by namespace, and Cgroup is used for resource restriction. In the cloud native era, all business loads are . Isolation and resource restrictions In the offline hybrid scenario, although Cgroups can be used to limit the resource usage of online and offline services, the current native Cgroups have different challenges in terms of resource oversold and offline scenarios, such as CPU, memory, network, and disk IO.
In terms of CPU, if you specify Limit for the created Pod, you can limit the maximum resource usage of the container through Cgroup quota, and use the CPU share weight to divide the CPU weight of different applications, but this method is okay when resources are not tight. There is a burst of traffic in online services. At this time, it is difficult for offline services to exit the core of the operation immediately, which causes the SLO jitter of online services.
To ensure the stability of online services , a common practice is be tied CPU core , the online service will be bound to a particular logic core, avoid other business occupied.
However, there will be two problems at this time. On the one hand, the CPU core monopolizes , and the problem of insufficient resource utilization, because once a core is monopolized, the CPU cannot be fully utilized, and the purpose of the mixing is to squeeze the resources of the CPU. , So that it can fully operate;
On the other hand, tied to cores, services that require parallel computing, whether online or offline, will be affected by the degree of parallelism. For example, although the original limit is limited to 4 cores at most, the services can actually be parallelized because they are not tied to cores. Using all CPUs, the parallelism can be greatly improved, but once the service is bound to 4 CPUs, the maximum parallelism is 4.
The contradictions in the above scenes are believed to have been encountered by many manufacturers who have landed in the offline mixing department.
In terms of memory, offline services often read a large amount of file data, causing the operating system to do page cache. The native operating system’s management of page cache is global, not container-dimensional. Containers are often under the resource restriction mechanism of cgroups. There is a problem that the page cache cannot be released in time, causing other containers to jitter in the memory allocation, and even the page cache is always occupied by another cgroup, and the cgroup cannot be cleaned up. In the offline mixing department, if it is an offline business, it is also used Page cache, then the offline business resources may not be able to be adjusted and suppressed successfully at this time because the page cache is not released.
Resource interference
Hyper-Threading Technology is actually a very common hardware virtualization method in modern CPU architecture.
To put it simply, modern CPUs are basically based on the Numa architecture. Each Numa node will have a Socket, and there will be a physical core Core in the Socket. Hyper-threading technology can also be turned on on the physical core, allowing the operating system to see the logic of multiple CPUs. The CPU commonly seen by the top command refers to the logical CPU. When hyperthreading is not enabled, the logical CPU is the physical core, but if hyperthreading is enabled, the logical CPU may be a logical CPU virtualized on the physical core. .
For example, if it is performed in an offline mixing department, the online service and the offline service are scheduled to run on different logical cores of the same physical core, and interference will occur at this time.
TKE plan
TKE designed and implemented mixed-department related solutions for the scenario of Tencent's internal self-developed business going to the cloud.
Scheduling guarantee
We adopt hybrid nodes to automatically report extended offline resources, and offline services use offline Cgroup box isolation to ensure the elastic reuse and recovery of resources.
In terms of scheduling enhancement, the mode of multi-scheduler sharing state scheduling, the first is to solve the problem of multiplexing scheduling of online resources, and the second is to solve scheduling conflicts, scheduling performance, scalability, and reliability.
Resource reuse
As mentioned above, the traditional mode of Kubernetes performs static scheduling according to the amount of request resources requested by the business. If both online and offline services are scheduled according to the request, the offline business is scheduled first, and the request resources of the node are occupied, then the online business cannot be scheduled. NS. In the same way, if online services are scheduled first, offline services cannot be scheduled, and the traditional mode scheduler will not be able to reuse resources of online services.
First of all , in the way of resource reuse, TKE accurately predicts the idle online resources, and exposes them to the offline scheduler by expanding the resources, so that the offline scheduler can see how many offline resources can be replicated. Use it, and then schedule it.
- For the offline load that request resources can be modified, in order to make the business not aware of the modification of the resource logic, webhook is used to dynamically modify the expression of native resources such as CPU in its resource, transform it into an extend resource expression, and transform it into a besteffort type Pod for offline scheduler scheduling Calculation use.
- For the best-effort type of offline load, we flexibly reuse the online resources based on the extended offline resources reported by the hybrid nodes. The extended resources change in real time with the load level of the nodes. The above resource conflict reuse has already described the specific solutions. .
- For offline loads that cannot be modified by request resources (such as driver pods), they will be scheduled according to the real request at this time, and there may be conflicts with the online scheduler, because in a mixed cluster, the request resources of the entire cluster are often already Full of online business.
followed by . After resource reuse, there needs to be a level of restriction to limit offline load and not excessive use of the host's resources; in terms of underlying resource restrictions, for online and offline services, they are restricted to different Cgroup levels:
- For online services, the resource requirements are set normally, scheduling is performed according to its request resources, and finally its resource limits are set according to the native Cgroup QoS management method of Kubernetes;
- For resource-intensive loads such as workers in offline services, all offline Pods are restricted to a resource pool constructed by a Cgroup parent hierarchy, that is, an offline frame. The advantage of this solution is that offline services can make full use of online idle resources. However, during online resource recovery, all offline tasks can be restricted to ensure the stability of online services.
directly treat offline tasks as Kubernetes native best effort? ?
The reason is that the best effort type load under the Kubernetes mechanism does not set resource limits at the Cgroup layer. Once the Pod uses resources abnormally, it will cause the risk of online resources being preempted and memory crowded, so it must be used A restricted upper-level Cgroup is used for restriction.
Since the native Cgroup manager of Kubernetes does not support customizing Cgroup levels and updating resources, the industry tends to invade the kubelet code and modify the Cgroup manager of kubelet, but the TKE full-stack mixed department is zero intrusion to Kubernetes. passed The CRI hijacking method is used to modify the underlying Cgroup in Kubernetes Quality of Service without modifying the Kubernetes code , which is a highlight for customers.
You don’t need to do any processing to enjoy the ultimate resource improvement effect . At the same time, the features of Kubernetes are fully compatible with the community. We can make a single node-granularity hybrid framework automatically go online and offline to help you mix when you need it. Department operations.
In resource prediction and load processing, TKE uses the exponential decay sliding window algorithm to achieve the purpose of rapid sensing resource rise and slow sensing resource decline, achieving automation and fine-grained time-sharing multiplexing goals; the reason for the need for fast The increase in resources is sensed because if the online service load increases, it is usually relatively short-lived. At this time, it needs to be quickly sensed, so that resources can be recovered and offline abdicated quickly.
In the process of load drop, it is generally not possible to reduce the resources of online services immediately, but to ensure that it runs for a period of time, confirm that the load has truly entered a stable state, and then offline can begin to reuse online resources.
first buckets the original data . After bucketing, you can eliminate the historical data points that need to be stored and reduce the amount of data storage. If you use the method of directly storing historical points, then the cluster of 10w containers will have multiple linear historical points. Increase, and using the bucketing statistics method, the value can be changed to a constant, and a histogram will be obtained after bucketing. Use the histogram to perform a sliding window algorithm to obtain the average and quantile value of resource prediction; then through time attenuation Function to increase the weight of the nearest point, so that can eliminate the effect of the relatively long history data calculated by the histogram, and make a short-term forecast of current resources .
In addition, the process of bucketing is non-uniform bucketing, the purpose is can quickly sense the rise of resources, and the slow sensing resource declines ; we can understand it so simply, assuming that the weights of all points are The same is 1 (the weights are actually the height of the histogram), then we use P90 as the predicted value, which is actually the area of the histogram. If most of the current points are low-load points, P90 is a relatively low value. Then suddenly there is a sudden increase in the load of some points, and they will all fall into the larger barrel, that is, the width becomes larger, and the area behind the barrel can immediately become the dominant area, then the P90 must immediately move into the barrel;
and in turn , when the load drops from high load to low load, the point of low load is in the front bucket. At this time, the width of the bucket is not enough and the area cannot become the dominant area, so P90 will not be so fast Only when most of the points are lower, their height in the small barrel increases, then the small barrel can become the dominant area, and the predicted value of P90 can decrease.
Scheduling enhancement
In the offline hybrid scenario, because each scheduler works independently and the data of ClusterState is not synchronized, then multiple schedulers select a node at the same time, but the resource write conflict problem occurs.
The current distributed resource scheduling schemes are mainly the following types:
Plan (Approach) | Resource view (Resource choice) | Interference | Alloc. granularity | Cluster-wide strategy |
---|---|---|---|---|
Global scheduler (Kubernetes/Borg) | Global view | None (serial) | Global search, single dispatch unit allocation | Strict priority |
Static partition (such as label partition) | Fixed subset | None (partition) | Partition isolation strategy, single dispatch unit allocation | Different schedulers are independent of each other |
Two-tier scheduling (Mesos/Yarn) | Dynamic subset | Pessimistic concurrency | Global search, heap allocation, deadlock or long-term wait | Strictly fair dispatch |
Shared state (Omega) | Global view | Optimistic concurrency | Each scheduler can decide how to allocate | Each scheduler's own strategy |
If a scheduler is used to complete the scheduling of online and offline services, the logic of the scheduler is often complicated, functions are piled up, and it is not easy to maintain and iterate. Especially when the cluster size is relatively large, the scheduler will be affected in terms of performance, reliability, iteration speed, flexibility, etc. .
If you directly deploy two schedulers in the cluster, because multiple schedulers are in the same Kubernetes cluster, they use the same cluster state to complete the scheduling, but the status update here is not synchronized between multiple schedulers. This will cause scheduling conflicts, that is to say, two schedulers will select the same node at the same time, but in fact, the current node is only enough to put the Pod scheduled by one of the schedulers, and cannot be placed at the same time.
shared state scheduling, regardless of resource view sharing, concurrency, flexibility of resource allocation and flexible support for multiple schedulers, all perform well . Therefore, TKE adopts a shared state optimistic concurrency scheduling method. This solution has high requirements on the performance and reliability of the coordinator, but it can achieve real resource sharing, global consistency of resource views, and support for customer deployment. A different scheduler to schedule for different scenarios.
TKE designed and implemented the scheduling coordinator. The Kubernetes scheduler only needs to develop extension plug-ins and submit the reserve in the reserve phase to complete the shared state concurrency.
The coordinator uses gRPC calls to improve its performance and throughput through a message-driven mode, and the coordinator is currently a very lightweight design.
- The requests and events received by the Coordinator are placed in the back-end multi-dimensional queue
- Each Node queue prioritizes status update events and high-priority resource requests
- Concurrent processing of queues of different Nodes
- Each request only performs the most basic resource conflict check, which is very lightweight
Resource Guarantee
In the offline mixed department, TKE provides a full-dimensional resource guarantee in the stand-alone resource guarantee combined with the TencentOS kernel; it provides a powerful resource isolation and guarantee mechanism on the Kubernetes and kernel side.
Multi-priority strategy
In terms of priority, TKE uses refined CPUSet orchestration technology. According to different types of service priorities, such as high-quality online services, it is bundled with cpuset cores. For middle-quality online services, Cgroup quota and cpushare are used for CPU resource sharing. For offline services, the offline big box is used to divide all offline services under an offline Cgroup resource pool, but all CPU cores can be used, or offline services can be tied to CPUs that are completely mutually exclusive with high-quality online services. Subset.
For scenarios where cores need to be tied, a hyperthreading avoidance core allocation algorithm is adopted. The algorithm adopts a greedy strategy and preferentially selects the entire core binding instead of directly assigning indiscriminately according to the logical CPU. The scheduling algorithm can perceive the CPU topology logic.
For memory binding scenarios, Numa-aware scheduling strategy is adopted for Pod scheduling.
Resource isolation
In addition to restrictions and isolation based on traditional Cgroups, such as the use of CPU quota and CPU share for resource restriction and isolation, and cpuset for core binding, TKE has also fully customized and optimized at the kernel level to adapt to cloud-native scenarios.
In terms of CPU, in order to cope with burst traffic and hyperthreading interference at the micro level, the TencentOS cloud native kernel supports selecting BT scheduling class for offline task scheduling, which can quickly reduce offline task interference at the kernel level and prevent starvation of offline tasks. state.
BT scheduling can prevent multiple offline tasks from starving. When the resource consumption of online services increases suddenly, in order to ensure online services, the priority of offline services is often lowered. Offline services will have fewer and fewer time slices. The offline task at the head of the team has been unable to finish running, causing all other offline businesses to starve to death. Tencent's kernel BT scheduling can prevent starvation among multiple offline businesses and alternately execute offline businesses.
In terms of memory, the TencentOS kernel has a Cgroup-level cache cleaning function to release the cache of a container in time. For example, after the offline business is completed, the page cache triggered by its pod can be cleaned up in time.
In terms of network and disk IO, the cloud native kernel has been self-developed and controlled, and the interfaces are all standard cgroup interfaces. The TKE hybrid department makes full use of related qos to guarantee the qos of the container.
Interference check
Business SLO interference check, on the one hand, system-level indicators of interference check , on the other hand, application-level indicators of interference check .
At the system level, TKE collects various system resource indicators, such as sensing instruction set frequency CPI, sensing system calls and other means to obtain system indicator interference.
At the application level , TKE allows online services to set its own SLO interference threshold, and the TKE hybrid system can call back and check the SLO of the service. Once it is checked that the real SLO of the service does not meet expectations, a series of measures will be taken to eliminate interference sources. , Including the signal processing mode controlled by the state machine, by compressing offline resources, prohibiting offline scheduling, expelling and other progressive means to ensure the SLO of the business and the stability of the entire node.
For the SLO of offline services, TKE allows dynamic priority adjustment and flexible public cloud to avoid long waiting or frequent expulsion of offline services and ensure that offline services can be completed within a specified time.
Summary and outlook
This article focuses on the most critical resource guarantee and scheduling guarantee in the offline mixing department, expounds the TKE's offline mixing program, for resource guarantee, through application priority division, kernel enhancement, interference check, hyperthreading avoidance and other key methods to ensure resource isolation between applications; for scheduling assurance, slow back-off prediction algorithm, offline large frame, shared state scheduling and other methods are used to implement resource flexible reuse and resolve scheduling conflicts.
In the future development of mixing department, first indiscriminate mixing department . The scene of mixing department will no longer be limited to offline, but more complex types of load mixing department, and more loads of different priority levels will appear. Perform pooling processing of multi-priority resource pools, preemption between resource pools and recovery technologies between pools, and fully explore system-level resource interference, such as CPI interference check, eBPF observation technology; second is the ultimate combination of hybrid + elasticity , Hybrid cloud IDC and public cloud extreme resource sharing, multi-cloud multi-cluster resource time-sharing multiplexing scheduling, in order to achieve the essential goal of the cloud-reducing costs and increasing efficiency.
refer to
Tencent TencentOS's iterative evolution path of cloud native in ten years:
https://mp.weixin.qq.com/s/Cbck85WmivAW0mtMYdeEIw
Intel® Hyper-Threading Technology: https://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。