Google saves hundreds of millions of dollars every year, and the resource utilization rate is as high as 60%. How powerful is the technology used!

author

The author Tian Qi, a senior engineer at Tencent, focuses on large-scale offline mixing, distributed resource management and scheduling, is familiar with Kubernetes, and focuses on cloud-native big data and AI.

Lead

What is the offline mixing department

, big data, and artificial intelligence, in order to meet business needs, the IT environment of enterprises usually runs two types of services. One type is 1613099a2b6724 online services and the other is offline operations .

online service : often running for a long time, the service flow has periodic characteristics, and the overall resource utilization rate is not high, but it has extremely high requirements on the service SLA, such as web search services, e-commerce transaction services, etc.

Offline operation : It is often a resource-intensive service, but it can tolerate high latency and restart of failed tasks, such as big data analysis services, machine learning training services, etc.

These two types of service loads have great room for optimization in terms of time-division multiplexing and resource complementation, making them the preferred scenario for the mixing department. The so-called in the offline mixing department refers to combining offline operations and online services. Deploy to the same node to improve resource utilization and reduce the cost of the company’s increasing offline computing resources .

Co-location value offline

Improved resource utilization

According to Gartner statistics, the average utilization rate of global data centers is less than 15%, and there is a huge waste of resources every year.

The main reasons for the inefficiency of resources are as follows:

business traffic is periodic . For online services, in order to ensure the business SLA during peak traffic, resources are often evaluated according to the highest peak value. For example, in the take-out business, 8-core CPU may be required during the peak period (meal time), but during the low peak period (night), resources may not be consumed, resulting in low resource utilization during most periods of time, resulting in waste.
Cluster resource fragmentation , the so-called resource fragmentation, refers to the server still has certain static resources that have not been allocated, but because the resources of each dimension (such as CPU and ram) are not balanced at this time, there is no way to continue to allocate resources. Due to the current mainstream resource scheduling frameworks, static resource allocation algorithms are used to allocate resources, which will eventually cause resource fragmentation, which makes it impossible to effectively use resources.
in the offline computer room . The granularity of the resource pool division is too coarse. Some companies will completely isolate the online computer room (mainly deploying online services such as Web) and offline computer room (mainly running offline clusters such as Hadoop). Under such a coarse granularity , There are a lot of resources in the online computer room that are idle and cannot be used by offline services, and vice versa. When the offline computer room is idle, online services cannot be fully utilized, and it is impossible to realize the intercommunication of resource pools between different IDCs.

can make full use of the idle resources of nodes by mixing offline, thereby improving resource utilization.

Cost optimization

The offline hybrid department is currently implemented in various large and medium-sized Internet companies. can improve resource utilization through the hybrid department, which can obtain considerable cost savings and obtain huge economic value under the effect of scale.

A simple calculation and analysis can be made below. We can increase the resource utilization rate by 20%, which can roughly save the budget:

Assuming that all our current machines have 10w core CPUs, and the average resource usage rate of each machine is 20%, then 0.2 10w = 2w cores, assuming the same business scale, assuming that the average resource usage rate increases to 40%, We only need 5w cores to meet business needs. Assuming that the average price of the CPU is 300 yuan/core/year, we can save 5w 300 = 1500w yuan/year.

For companies with cost control requirements, offline mixing is the first choice to reduce costs and increase efficiency. For example, Google has deployed all businesses in the Borg (the predecessor of Kubernetes) system. Its resource utilization rate can reach 60%, which can be achieved every year. Save hundreds of millions of dollars .

challenge

Scheduling guarantee

Resource reuse

The traditional mode of Kubernetes performs static scheduling according to the amount of request resources applied for by the business. If both online and offline services are scheduled according to the request, and the offline business is scheduled first, and it fills up the request resources of the node, then the online business cannot be scheduled. For the same reason, If the online service is scheduled first and fills up the node request resources, the offline service cannot be scheduled. The traditional mode scheduler will not be able to reuse the resources of the online service.

traditional resource multiplexing method often adopts time-sharing multiplexing, which is to run offline services at a fixed time point. For example, after the early morning, the offline services are scheduled to run, and the online services are scheduled during the day. In this mode of resource reuse, often too coarse in time granularity . Although online resources can be reused in a short period of time, there is a relatively strict time limit.

The other is resource reservation , which divides the resources of a machine into online resources, offline resources, and offline shared resources. This method uses a static division method to divide the resources of the whole machine and cannot perform flexible replication. Instead, the resources for online and offline services can only be reserved in advance. Although the deployment of offline services can improve resource utilization to a certain extent, the reuse is not sufficient, and machines with large resource specifications are required to statically divide resources.

Therefore, if you want efficiently and automate for fine-grained resource time-sharing multiplexing, you must have timely and accurate resource forecasting methods, the ability to quickly respond to resource changes, , and a set of 1613099a2b696e that can be used when the resource level changes. Service safeguard measures carried out.

Scheduling enhancement

Due to the differences in working modes between online and offline services, communities often use different schedulers for scheduling.

In the mixed scenario, the online scheduler and the offline scheduler are deployed in the cluster at the same time. When resources are tight, the scheduler will have resource conflicts and can only try again. At this time, the throughput and scheduling performance of the scheduler will be affected. Great impact, ultimately affecting the scheduling SLA.

At the same time, in large-scale batch scheduling scenarios, native Kubernetes cannot support it. It only supports online business scheduling.

Resource Guarantee

Online business and offline business originally belong to different types of work. Deploying these two loads on the same node will cause resource interference. The so-called resource interference means that when resources are tight or traffic bursts, the online business is in the resource Use will be interfered by offline services. most important goal of 1613099a2b69e2 in the offline mixing department is to improve the utilization rate of single machine resources while guaranteeing the service SLA online and offline business.

For online services, it is necessary to ensure that during the peak traffic period and before there is no mixing, it can not cause large interference, and the interference rate needs to be reduced to less than 5%.
For offline services, it is not always possible to be hungry or frequently expelled because the priority is not as good as that of online services, which will affect the total running time and SLA of offline services.

Resource isolation

The essence of a container is a restricted process. Processes are isolated by namespace, and Cgroup is used for resource restriction. In the cloud native era, all business loads are . Isolation and resource restrictions In the offline hybrid scenario, although Cgroups can be used to limit the resource usage of online and offline services, the current native Cgroups have different challenges in terms of resource oversold and offline scenarios, such as CPU, memory, network, and disk IO.

In terms of CPU, if you specify Limit for the created Pod, you can limit the maximum resource usage of the container through Cgroup quota, and use the CPU share weight to divide the CPU weight of different applications, but this method is okay when resources are not tight. There is a burst of traffic in online services. At this time, it is difficult for offline services to exit the core of the operation immediately, which causes the SLO jitter of online services.

To ensure the stability of online services , a common practice is be tied CPU core , the online service will be bound to a particular logic core, avoid other business occupied.

However, there will be two problems at this time. On the one hand, the CPU core monopolizes , and the problem of insufficient resource utilization, because once a core is monopolized, the CPU cannot be fully utilized, and the purpose of the mixing is to squeeze the resources of the CPU. , So that it can fully operate;

On the other hand, tied to cores, services that require parallel computing, whether online or offline, will be affected by the degree of parallelism. For example, although the original limit is limited to 4 cores at most, the services can actually be parallelized because they are not tied to cores. Using all CPUs, the parallelism can be greatly improved, but once the service is bound to 4 CPUs, the maximum parallelism is 4.

The contradictions in the above scenes are believed to have been encountered by many manufacturers who have landed in the offline mixing department.

In terms of memory, offline services often read a large amount of file data, causing the operating system to do page cache. The native operating system’s management of page cache is global, not container-dimensional. Containers are often under the resource restriction mechanism of cgroups. There is a problem that the page cache cannot be released in time, causing other containers to jitter in the memory allocation, and even the page cache is always occupied by another cgroup, and the cgroup cannot be cleaned up. In the offline mixing department, if it is an offline business, it is also used Page cache, then the offline business resources may not be able to be adjusted and suppressed successfully at this time because the page cache is not released.

Resource interference

Hyper-Threading Technology is actually a very common hardware virtualization method in modern CPU architecture.

To put it simply, modern CPUs are basically based on the Numa architecture. Each Numa node will have a Socket, and there will be a physical core Core in the Socket. Hyper-threading technology can also be turned on on the physical core, allowing the operating system to see the logic of multiple CPUs. The CPU commonly seen by the top command refers to the logical CPU. When hyperthreading is not enabled, the logical CPU is the physical core, but if hyperthreading is enabled, the logical CPU may be a logical CPU virtualized on the physical core. .

For example, if it is performed in an offline mixing department, the online service and the offline service are scheduled to run on different logical cores of the same physical core, and interference will occur at this time.

TKE plan

TKE designed and implemented mixed-department related solutions for the scenario of Tencent's internal self-developed business going to the cloud.

Scheduling guarantee

We adopt hybrid nodes to automatically report extended offline resources, and offline services use offline Cgroup box isolation to ensure the elastic reuse and recovery of resources.

In terms of scheduling enhancement, the mode of multi-scheduler sharing state scheduling, the first is to solve the problem of multiplexing scheduling of online resources, and the second is to solve scheduling conflicts, scheduling performance, scalability, and reliability.