Author: Zhang Zuowei Li Tao

In April 2022, Koordinator, Alibaba Cloud's native co-location system, announced the official open source. After several months of iteration, Koordinator has released 4 versions, which can effectively help enterprise customers improve the efficiency, stability and computing cost of cloud-native workloads.

Yesterday (June 15th), in the Alibaba Yunqi live broadcast room, two technical experts from the Koordinator community, Zhang Zuowei (Youyi) and Li Tao (Lu Feng), shared how Koordinator responded from the project's architecture and features. Challenges in co-location scenarios, especially improving the efficiency and stability of workloads in co-location scenarios, as well as thinking and planning for subsequent technology evolution. We have also sorted out the core content of this live broadcast, hoping to bring you some in-depth inspiration.

Click the link to view the live replay now!

https://yqh.aliyun.com/live/detail/28787

Pay attention to Alibaba Cloud's native public account, and reply to [0615] in the background to get the full PPT

Introduction and Development of Co-working Technology

The concept of co-location can be understood from two perspectives. From the node dimension, co-location is to deploy multiple containers on the same node. The applications in these containers include both online and offline types; from the cluster dimension From the perspective of co-location, a variety of applications are deployed in a cluster, and through predictive analysis of application characteristics, the peak and valley of resource usage of services can be realized, so as to achieve the effect of improving the utilization of cluster resources.

Based on the above understanding, we can clarify the target problems and technical solutions that the co-location needs to solve. In essence, the original intention of our implementation of co-location comes from the relentless pursuit of data center resource utilization efficiency. According to an Accenture report, the average machine utilization rate of public cloud data centers in 2011 was less than 10%, which means that the resource cost of enterprises is extremely high. bigger. In fact, it has become an inevitable trend for big data to go to the cloud in a cloud-native way. According to Pepperdata's survey report in December 2021, a considerable number of enterprise big data platforms have begun to migrate to cloud-native technology. More than 77% of respondents reported that they expect 50% of their big data applications to migrate to the Kubernetes platform by the end of 2021. As a result, choosing a mixed deployment of batch-type tasks and online service-type applications has become a common choice in the industry. Public data shows that through co-location, the resource utilization rate of related technology leading companies has been greatly improved.

In the face of co-located technology, managers with different roles will have their own focus on specific concerns.

For cluster resource administrators, they expect to simplify the management of cluster resources, achieve clear insights into the resource capacity, allocation, and usage of various applications, improve cluster resource utilization, and reduce IT costs. .

For administrators of online-type applications, they are more concerned about the mutual interference when containers are deployed in a mixed manner, because resource competition will be more likely to occur in mixed offices, and application response time will have tail latency, resulting in a decrease in application service quality.

However, administrators of offline applications expect that the co-location system can provide hierarchical and reliable resource oversold to meet the differentiated resource quality requirements of different job types.

In response to the above problems, Koordinator provides the following mechanisms, which can fully meet the technical requirements of different roles for the co-location system:

  • Resource priority and service quality model for co-location scenarios
  • Stable and reliable resource oversold mechanism
  • Fine-grained container resource orchestration and isolation mechanism
  • Enhanced scheduling capabilities for multiple types of workloads
  • Fast access capability for complex types of workloads

Introduction to Koordinator

The figure below shows the overall architecture of the Koordinator system and the role division of each component. The green part describes each component of the K8s native system, and the blue part is the extended implementation of Koordinator on this basis. From the perspective of the entire system architecture, we can divide Koordinator into two dimensions: central control and stand-alone resource management. On the central side, Koordiantor has made corresponding expansion capabilities both inside and outside the scheduler; on the stand-alone side, Koordinator provides two components, Koordlet and Koord Runtime Proxy, which are responsible for the refined management of stand-alone resources and QoS guarantee capabilities.

 title=

The detailed functions of each component of Koordinator are as follows

  • Koord-Manager
    • SLO-Controller: Provides core management and control capabilities such as resource oversold, co-located SLO management, and refined scheduling enhancement.
    • Recommender: Provides related elastic capabilities for applications around resource portraits.
    • Colocation Profile Webhook: Simplifies the use of the Koordinator co-location model, provides one-click access for applications, and automatically injects relevant priorities and QoS configurations.
  • Koord extensions for Scheduler: Enhanced scheduling capabilities for co-located scenarios.
  • Koord descheduler: Provides a flexible and scalable rescheduling mechanism.
  • Koord Runtime Proxy: As a proxy between Kubelet and Runtime, it meets the resource management requirements of different scenarios, provides a plug-in registration framework, and provides an injection mechanism for related resource parameters.
  • Koordlet: Responsible for the QoS guarantee of the Pod on the single machine side, providing fine-grained container indicator collection, as well as interference detection and policy adjustment capabilities, and supports a series of Runtime Proxy plug-ins for fine-grained isolation parameter injection.

In Koordinator's design model, a core design concept is Priority. Koordinator defines four levels, namely Product, Mid, Batch, and Free. Pod needs to specify the resource priority of the application, and the scheduler will be based on each The total amount of resource priorities and allocations are scheduled. The total amount of resources of each priority will be affected by the request and usage of high-priority resources. For example, Product resources that have been applied for but not used will be re-allocated with Batch priority. The specific capacity of each resource priority of the node, the Koordinator will update it in the Node information in the form of standard extend-resource.

The following figure shows the capacity of each resource priority of a node. The black line total represents the total physical resources of the node, the red line represents the real usage of high-priority products, and the blue line to the black line reflects Based on the resource oversold changes of the Batch priority, we can see that when the Product priority is at the trough of resource consumption, the Batch priority can obtain more oversold resources. In fact, the aggressive or conservative resource priority strategy determines the oversold capacity of cluster resources, which we can also see from the analysis of the oversold situation of the Mid resource priority corresponding to the green line in the figure.

 title=

As shown in the following table, Koordinator defines the priority of each resource in the form of PriorityClass of the K8s standard, which represents the priority of Pod's application for resources. In the case of oversold multi-priority resources, low-priority Pods will be suppressed or evicted when single-machine resources are tight. In addition, Koordinator also provides sub-priority at the Pod level for fine-grained control at the scheduler level (queuing, preemption, etc.).

 title=

Another core concept in Koordinator's design is Quality of Service. Koordinator extends the QoS model at the Pod Annotation level, which represents the resource quality of Pods during single-machine operation, mainly in terms of isolation of use. The parameters are different. When the resources of a single machine are tight, it will give priority to meet the requirements of high-level QoS. As shown in the table below, Koordinator divides QoS into three categories: System (system-level service), Latency Sensitive (latency-sensitive online service), and Best Effort (resource-consuming offline application). Latency Sensitive is further subdivided into LSE, LSR and LS.

 title=

In the use of Priority and QoS, the two are two orthogonal dimensions as a whole, which can be used in combination. However, affected by the model definition and actual requirements, some permutations and combinations are restricted. The following table shows some combinations that are often used in mixed scenes, where "O" represents a common permutation, and "X" represents a rarely used permutation.

 title=

Examples of actual use of each scenario are as follows.

  • Typical scenario:
    • Prod + LS: A typical online application, usually has higher requirements on application delay and higher resource quality, and also needs to ensure certain resource elasticity.
    • Batch + BE: It is used for low-quality offline in mixed-part scenarios, and has considerable tolerance for resource quality, such as batch-type Spark/MR tasks and AI-type training tasks
  • Enhancements for typical scenarios:
    • Prod + LSR/LSE: For sensitive online applications, it is acceptable to sacrifice resource elasticity in exchange for better determinism (such as CPU core binding), which requires extremely high application latency.
    • Mid/Free + BE: Compared with "Batch + BE", the main difference is the level of resource quality requirements.
  • Atypical application scenarios:
    • Mid/Batch/Free + LS: It is used for low-priority online services, near-line computing, and AI inference tasks. Compared with big data tasks, these tasks cannot accept low resource quality, and they are not suitable for other applications. Interference is also relatively low; and they can tolerate relatively low resource quality, such as a certain level of eviction, compared to typical online services.

Quick Start

Koordinator supports flexible access to co-location of various workloads. Here we take Spark as an example to introduce how to use co-location oversold resources. There are two modes for running Spark tasks in a K8s cluster: one is to submit through Spark Submit, which is to use the Spark client to connect directly to the K8s cluster locally. This method is relatively simple and fast, but it lacks overall management capabilities. , which is often used for development self-test; another way is to submit it through the Spark Operator, as shown in the figure below, it defines the SparkApplication CRD, which is used for the description of the Spark job. The user can submit the SparkApplication CR to the APIServer through the kubectl client, and then The Spark Operator is responsible for the job life cycle and the management of the Driver Pod.

 title=

With the support of the Koordinator capability, the ColocationProfile Webhook will automatically inject the relevant colocation configuration parameters (including QoS, Priority, extended-resource, etc.) into the Pod of the Spark task, as shown below. Koordlet is responsible for the Spark Pod on the single machine side, which will not affect the performance of online applications after co-locating. By co-locating Spark and online applications, the overall resource utilization of the cluster can be effectively improved.

 # Spark Driver Pod example
apiVersion: v1
kind: Pod
metadata:
  labels:
    koordinator.sh/qosClass: BE
...
spec:
  containers:
  -  args:
      - driver
...
resources:
        limits:
          koordinator.sh/batch-cpu: "1000"
          koordinator.sh/batch-memory: 3456Mi
        requests:
          koordinator.sh/batch-cpu: "1000"
          koordinator.sh/batch-memory: 3456Mi
...

Introduction of key technologies

Resource Overcommitment - Resource Overcommitment

When using a K8s cluster, it is difficult for users to accurately evaluate the resource usage of online applications, and they do not know how to better set the Request and Limit of Pods. Therefore, in order to ensure the stability of online applications, larger resource specifications are often set. . In actual production, the actual CPU utilization of most online applications is relatively low most of the time, and may be as high as ten or twenty percent, which wastes a lot of allocated but unused resources.

 title=

Koordinator recycles and reuses this part of the allocated but unused resources through the resource over-issue mechanism. Koordinator evaluates how many resources of the Pod of the online application can be reclaimed according to the indicator data (as shown in the figure above, the part marked Reclaimed is the reclaimable resource), and these reclaimable resources can be over-issued to low-priority work Load usage, such as some offline tasks. In order to make these low-priority workloads easy to use these resources, Koordinator will update these over-issued resources to NodeStatus (as shown in node info below). When an online application has a burst of requests to process, it requires more resources. Koordinator helps the online application get these resources back through rich QoS enhancement mechanisms to ensure service quality.

 # node info
allocatable:
   koordinator.sh/bach-cpu: 50k # milli-core
   koordinator.sh/bach-memory: 50Gi 

# pod info
annotations:
    koordinator.sh/resource-limit: {cpu: “5k”}
resources:
     requests
         koordinator.sh/bach-cpu: 5k # milli-core
         koordinator.sh/bach-memory: 5Gi

Load Balance Scheduling - Load-Aware Scheduling

Over-distribution of resources can greatly improve the resource utilization of the cluster, but it also highlights the uneven resource utilization among nodes in the cluster. This phenomenon also exists in a non-co-located environment, but because K8s does not natively support the resource over-distribution mechanism, the utilization rate on nodes is often not very high, which masks this problem to a certain extent. However, this problem is exposed when the resource utilization rate rises to a relatively high water level when co-located.

Uneven utilization is generally uneven between nodes and local load hot spots, which may affect the overall operation effect of the workload. Another is that on nodes with high load, there may be serious resource conflicts between online applications and offline tasks, which affects the runtime quality of online applications.

 title=

To solve this problem, Koordinator's scheduler provides a configurable scheduling plugin to control cluster utilization. The scheduling capability mainly depends on the node indicator data reported by koordlet. During scheduling, nodes with a load higher than a certain threshold will be filtered out to prevent Pods from being unable to obtain good resource guarantees on such nodes with high load. On the other hand It is to prevent nodes with high load from continuing to deteriorate. Nodes with lower utilization are selected during the scoring phase. This plugin will avoid the overheating of the cold node after a period of time when too many Pods are dispatched to the cold node machine instantaneously based on the time window and estimation mechanism.

 title=

Application Access Management - ClusterColocationProfile

We considered at the beginning of the open source of the Koordinator project that we need to lower the threshold for using the Koordinator co-location system, so that everyone can gain benefits by simply and quickly using the co-location technology. Therefore, Koordinator provides a ClusterColocationProfile CRD. Through this CRD and the corresponding Webhook, you can enable the co-location capability with one click for different Namespaces or different workloads as needed without invading the components in the stock cluster, and the Webhook will According to the rules described in the CRD, Koorinator priority, QoS configuration and other co-location protocols are automatically injected into the newly created Pod.

 apiVersion: config.koordinator.sh/v1alpha1
kind: ClusterColocationProfile
metadata:
  name: colocation-profile-example
spec:
  namespaceSelector:
    matchLabels:
      koordinator.sh/enable-colocation: "true"
  selector:
    matchLabels:
      sparkoperator.k8s.io/launched-by-spark-operator: "true"
  qosClass: BE
  priorityClassName: koord-batch
  koordinatorPriority: 1000
  schedulerName: koord-scheduler
  labels:
    koordinator.sh/mutated: "true"
  annotations: 
    koordinator.sh/intercepted: "true"
  patch:
    spec:
      terminationGracePeriodSeconds: 30

For example, the above is an instance of ClusterColocationProfile, which means that all Namespaces with the label koordinator.sh/enable-colocation=true and Pods created by SparkOperator jobs under this Namespace can be converted to BE-type Pods (BTW: created by SparkOperator The label sparkoperator.k8s.io/launched-by-spark-operator=true will be added to the Pod to indicate that this Pod is the responsibility of SparkOperator).

Just follow the steps below to complete the hybrid access:

 $ kubectl apply -f profile.yaml
$ kubectl label ns spark-job -l koordinator.sh/enable-colocation=true
$ # submit Spark Job, the Pods created by SparkOperator are co-located other LS Pods.

QoS Enhancements – CPU Suppress

In order to ensure the runtime quality of online applications in co-located scenarios, Koordinator provides rich QoS enhancement capabilities on the stand-alone side.

First introduce the CPU Suppress (CPU dynamic suppression) feature. As mentioned earlier, online applications will not completely use up the resources applied for most of the time, and there will be a lot of idle resources. These idle resources can not only be used by newly created offline tasks through resource overruns, but also can be used on nodes When there are no new offline tasks that need to be executed, try to share the idle CPU resources with the existing offline tasks as much as possible. As shown in this figure, when koordlet finds that the resources of the online application are idle, and the CPU used by the offline task has not exceeded the safety threshold, then the idle CPU within the safety threshold can be shared with the offline task, so that the offline task can be faster. implement. Therefore, the load of the online application determines how much CPU is available in the BE Pod. When the online load increases, koordlet will suppress the BE Pod through CPU Suppress and return the shared CPU to the online application.

 title=

QoS enhancements – eviction based on resource satisfaction

When the load of CPU Suppress online applications increases, offline tasks may be frequently suppressed. Although this can well guarantee the runtime quality of online applications, it still has some impact on offline tasks. Although offline tasks are of low priority, frequent suppression will lead to unsatisfactory performance of offline tasks, and will seriously affect offline service quality. In addition, there are some extreme cases of frequent suppression. If offline tasks hold special resources such as kernel global locks when they are suppressed, frequent suppression may cause problems such as priority inversion, which will affect online applications. Although this doesn't happen very often.

 title=

To solve this problem, Koordinator proposes an eviction mechanism based on resource satisfaction. We take the ratio of the total amount of CPU actually allocated to the total amount of CPU expected to be allocated as the CPU satisfaction. When the CPU satisfaction of the offline task group is lower than the threshold, and the CPU utilization of the offline task group exceeds 90%, koordlet will expel some low-priority offline tasks and release some resources for higher-priority offline tasks. Through this mechanism, the resource requirements of offline tasks can be improved.

QoS Enhancement - CPU Burst

We know that CPU utilization is an average of CPU usage over a period of time. And most of the time, we observe the statistical CPU utilization with a coarse time unit granularity. At this time, we observe that the change of CPU utilization is basically stable. But if we observe the statistics of CPU utilization with a finer time unit granularity, we can see that the burst characteristics of CPU usage are very obvious and unstable. As shown in the figure below, the utilization ratio (purple) observed at 1s granularity and the utilization ratio (green) observed at 100ms granularity are compared.

 title=

Fine-grained data observations show that CPU bursts and throttling are the norm. In the Linux kernel, the CPU consumption of the cgroup is controlled by the CFS bandwidth controller, which limits the upper limit of the CPU consumption of the cgroup. Therefore, it is often encountered that some services are severely throttled in a short period of time under burst traffic, resulting in long-tail delay, resulting in quality of service. Decrease, as shown in the figure below, Req2 is delayed until the 200ms because the CPU is suppressed.

 title=

To solve this problem, Koordinator helps online applications deal with emergencies based on CPU Burst technology. CPU Burst allows workloads to use daily CPU resources when there are burst requests processing CPU resources. For example, the CPU resources used by the container in daily operation do not exceed the CPU current limit, and the spare CPU resources will be accumulated. When a large amount of CPU resources are required for container operation, the CPU resources will be used in bursts through the CPU Burst function, and the resources used for these bursts come from the accumulated resources. As shown in the figure below, because of the accumulated CPU resources, the burst Req2 can avoid being throttled through the CPU Burst function and process the request quickly.

 title=

QoS Enhancements – Group Identity

In the co-location scenario, although the Linux kernel provides a variety of mechanisms to meet the scheduling requirements of workloads with different priorities, when an online application and an offline task run on a physical core at the same time, because the offline tasks share the same physical resources, the performance of online applications will inevitably be disturbed by offline tasks, resulting in performance degradation. Alibaba Cloud Linux 2 supports the Group Identity function from the kernel version kernel-4.19.91-24.al7. Group Identity is a means of scheduling special priorities implemented in units of cgroup groups. Simply put, when online applications require more When resources are available, Group Identity can temporarily suppress offline tasks to ensure that online applications can respond quickly.

To use this feature is relatively simple, you can configure the cpu.bvt_warp_ns of the cpu cgroup. In Koordinator, the corresponding configuration of BE type offline tasks is -1, that is, the lowest priority, and the online application types such as LS/LSR are set to 2, that is, the highest priority.

 title=

QoS Enhancements – Memory QoS

Containers have two main constraints when using memory:

  • Self-memory limit: When the container's own memory (including Page Cache) is close to the container's upper limit, the kernel's memory recycling subsystem will be triggered, and this process will affect the performance of memory application and release of applications in the container.
  • Node memory limit: When the container memory is oversold (Memory Limit>Request), the memory of the whole machine is insufficient, which will trigger the global memory recovery of the kernel. This process has a great impact on performance, and in extreme cases, the whole machine is abnormal.

In order to improve application runtime performance and node stability, Koordinator introduces Memory QoS capability to improve memory performance for applications. When the function is enabled, koordlet configures the memory subsystem (Memcg) adaptively to optimize the performance of memory-sensitive applications on the basis of ensuring the fairness of node memory resources.

 title=

Follow-up evolution plan

Fine-grained CPU Orchestration - Find-grained CPUOrchestration

We are designing and implementing fine-grained CPU orchestration.

Why do we provide this orchestration mechanism? As resource utilization increases into the deep water area of co-location, it is necessary to perform more in-depth tuning of resource runtime performance. More refined resource scheduling can better ensure runtime quality, so that the utilization rate can be improved through co-location. push to a higher level.

We divided the Koordinator QoS online application LS types into three types: LSE, LSR and LS. The split QoS type has higher isolation and runtime quality. Through such a split, the entire Koordinator QoS semantics are more precise and complete, and are compatible with the existing QoS semantics of K8s.

In addition, we designed a set of rich and flexible CPU orchestration strategies for Koordinator QoS, as shown in the following table.

 title=

 title=

CPU orchestration strategy corresponding to Koordinator QoS

In addition, for the LSR type, two core binding strategies are provided to help users balance performance and economic benefits.

  • SameCore strategy: better isolation, but less flexibility.
  • Spread strategy: Moderate isolation, but can be optimized by other isolation strategies; better performance can be obtained than SameCore strategy when used properly; there is some room for flexibility.

 title=

Koordinator's refined CPU arrangement scheme is compatible with K8s' existing CPUManager and NUMA Topology Manager mechanisms. That is to say, when the existing cluster uses Koordinator, it will not affect the existing Pod, and it can be used safely and safely in grayscale.

Resource Reservation - Resource Reservation

Resource reservation is another feature we are designing. Resource reservation can help solve the pain points of resource management. For example, sometimes familiar Internet business scenarios have very strong peaks and valleys. Then we can reserve resources before the peak arrives to ensure that there must be resources to meet the peak request. In addition, you may also encounter problems when expanding the capacity. After the expansion is initiated, the Pod is pending in the cluster because there are no resources. If you can confirm whether the resources are available before the expansion, you can have a better experience by adding new machines when there are no resources. . There are also scenarios such as rescheduling. Resource reservation can be used to ensure that the expelled Pods must have resources available, which can greatly reduce the resource risk of rescheduling and use rescheduling capabilities more safely and securely.

The resource reservation mechanism of Koordinator will not invade the existing API and code of the K8s community. And supports PodTemplateSpec, imitating a Pod to find the most suitable node through the scheduler. It also supports the way of declaring ownership to support Pod to use reserved resources first. For example, when a real Pod is scheduled, it will first try to find suitable reserved resources according to the characteristics of the Pod, otherwise it will continue to use the idle resources in the cluster.

The following is an example of a Reservation CRD (the final design adopted by the Koordinator community shall prevail)

 kind: Reservation
metadata:
  name: my-reservation
  namespace: default
spec:
  template: ... # a copy of the Pod's spec
  resourceOwners:
    controller:
      apiVersion: apps/v1
      kind: Deployment
      name: deployment-5b8df84dd
  timeToLiveInSeconds: 300 # 300 seconds
  nodeName: node-1
status:
  phase: Available
  ...

Fine-grained GPU Scheduling - GPU Scheduling

Fine-grained GPU scheduling is a capability we expect to provide in the future. GPUs and CPUs have relatively large differences in resource characteristics, and in a model training scenario like machine learning, a training job will lead to different performance differences due to different topologies, for example, according to the different topology combinations among workers in the machine learning job. , you will get different performance, which is not only reflected in the nodes in the cluster, but even on a single node, there will be huge performance differences between GPU cards because of whether NVLINK is used or not, which makes the scheduling and allocation logic of the entire GPU change. is very complicated. Moreover, when the computing tasks of GPU and CPU are co-located in the cluster, how to avoid the waste of the two resources is also an optimization problem that needs to be considered.

 title=

Specification Recommendation - Resource Recommendation

Subsequent Koordinator will also provide the ability to recommend specifications based on portraits. As mentioned earlier, it is difficult for users to accurately assess the resource usage of an application. What is the relationship between Request and Limit, how should Request/Limit be set, and which combination is the most suitable for my application? Pod resource specifications are often overestimated or underestimated, resulting in wasted resources and even stability risks.

Koordinator will provide resource profiling capabilities, collect and process and analyze historical data, and recommend more accurate resource specifications.

 title=

Community Building

So far, we have released four versions in the last two months. The previous versions mainly provided the ability of resource over-distribution and QoS enhancement, and also open-sourced the new component koord-runtime-proxy. In version 0.4, we started to work on the scheduler, and firstly opened the load balancing scheduling capability. At present, the Koordinator community is implementing version 0.5. In this version, Koordinator will provide the ability to refine CPU scheduling and resource reservation. In the future planning, we will also implement rescheduling, Gang scheduling, GPU scheduling, elastic Quota, etc. Some new innovations.

 title=

 title=

We look forward to your positive feedback on any issues you encounter with Koordinator, help improve documentation, fix bugs, and add new features

  • If you find a typo, try to fix it!
  • If you find a bug, try to fix it!
  • If you find some redundant codes, try to remove them!
  • If you find some test cases missing, try to add them!
  • If you could enhance a feature, please DO NOT hesitate!
  • If you find code implicit, try to add comments to make it clear!
  • If you find code ugly, try to refactor that!
  • If you can help to improve documents, it could not be better!
  • If you find document incorrect, just do it and fix that!
  • ...

In addition, we also organize regular bi-weekly community meetings from 19:30 to 20:30 on Tuesday, and like-minded partners are welcome to join the exchange group for more information.

 title=

WeChat group

 title=

DingTalk Group

Click here to learn about the Koordinator project now!


阿里云云原生
1.1k 声望315 粉丝