头图
This article introduces the practice of Meituan in how to solve the problem of large-scale cluster management and design an excellent and reasonable cluster scheduling system, and expounds the problems, challenges and problems that Meituan is more concerned about when implementing the cloud native technology represented by Kubernetes. corresponding promotion strategy. At the same time, this article also introduces some special support for Meituan's business needs scenarios. I hope this article can help or inspire students who are interested in the cloud native field.

Introduction

The cluster scheduling system plays an important role in the enterprise data center. With the continuous increase in the scale of clusters and the number of applications, the complexity of developers dealing with business problems has also increased significantly. How to solve the problem of large-scale cluster management, design an excellent and reasonable cluster scheduling system, ensure stability, reduce costs, and improve efficiency? This article will answer them one by one.

| Remarks: The article was first published in "New Programmer 003" Developer Column in the Cloud Native Era.

Introduction to Cluster Scheduling System

The cluster scheduling system, also known as the data center resource scheduling system, is generally used to solve the resource management and task scheduling problems of the data center. Provides automated operation and maintenance capabilities to reduce service operation and maintenance management costs. Well-known cluster scheduling systems in the industry, such as open source OpenStack, YARN, Mesos, and Kubernetes, etc., and well-known Internet companies Google's Borg, Microsoft's Apollo, Baidu's Matrix, Alibaba's Fuxi and ASI.

As the core IaaS infrastructure of various Internet companies, the cluster scheduling system has undergone several architectural evolutions in the past ten years. With the evolution of business from monolithic architecture to SOA (Service-Oriented Architecture) and the development of microservices, the underlying IaaS facilities have gradually moved from the era of physical and bare metal to the era of containers. Although the core problem we have to deal with has not changed during the evolution, the complexity of the problem has also grown exponentially due to the rapid expansion of the cluster size and the number of applications. This article will explain the challenges of large-scale cluster management and the design ideas of the cluster scheduling system, and take the implementation of the Meituan cluster scheduling system as an example to describe how to create a unified scheduling service for multiple clusters, continuously improve the utilization of resources, and provide Kubernetes engine services. A series of cloud-native practices such as enabling PaaS components to provide a better computing service experience for businesses.

The challenge of large-scale cluster management

As we all know, the rapid growth of business has brought about a surge in the scale of servers and the number of data centers. For developers, in the business scenario of a large-scale cluster scheduling system, the two problems that must be solved are:

  1. How to manage the deployment and scheduling of large-scale in data , especially in cross-data center scenarios, how to achieve resource elasticity and scheduling capabilities, improve resource utilization as much as possible on the premise of ensuring application service quality, and fully reduce Data center costs.
  2. How to transform the underlying infrastructure, build a cloud- operating system for the business side, improve the computing service , realize the automatic disaster recovery response and deployment upgrade of the application, etc., reduce the mental burden of the business side on the management of the underlying resources, and allow the business side to be more Focus on the business itself.

Challenges of Operating Large-Scale Clusters

In order to solve the above two problems in a real production environment, it can be further divided into the following four large-scale cluster operation and management challenges:

  1. How solves the diverse needs of users and responds quickly to . The scheduling requirements and scenarios of the business are rich and dynamic. As a platform-based service such as a cluster scheduling system, on the one hand, it needs to be able to quickly deliver functions to meet business needs in a timely manner; The requirements are abstracted into general capabilities that can be implemented on the platform, and long-term iterations are carried out. This is a test of the technical evolution planning of the platform service team, because if it is not careful, the team will be caught in the endless development of business functions. Although the business needs are met, it will cause low-level duplication of team work.
  2. How does improve the resource utilization of the online application data center and at the same time guarantee the application service quality . Resource scheduling has always been a recognized problem in the industry. With the rapid development of the cloud computing market, various cloud computing vendors continue to increase their investment in data centers. The resource utilization rate of the data center is very low, which exacerbates the seriousness of the problem. Gartner research found that the CPU utilization rate of data center servers in the world is only 6% to 12%, and even the Amazon Elastic Compute Cloud (EC2, Elastic Compute Cloud) only has a resource utilization rate of 7% to 17%, which shows how serious the waste of resources is. The reason is that online applications are very sensitive to resource utilization, and the industry has to reserve additional resources to ensure the quality of service (QoS, Quality of Service) of important applications. The cluster scheduling system needs to eliminate the interference between applications and realize resource isolation between different applications when multiple applications are running in a mixed manner.
  3. How does provide automatic handling of instance exceptions for applications, especially stateful applications, shield differences in computer rooms, and reduce users' perception of the bottom layer . As the scale of service applications continues to expand and the cloud computing market matures, distributed applications are often deployed in data centers in different regions, or even across different cloud environments, achieving multi-cloud or hybrid cloud deployment. The cluster scheduling system needs to provide a unified infrastructure for business parties, realize a hybrid multi-cloud architecture, and shield the underlying heterogeneous environment. At the same time, it reduces the complexity of application operation and maintenance management, improves the automation of applications, and provides a better operation and maintenance experience for the business.
  4. How does solve the performance and stability risks related to cluster management caused by a single cluster being too large or the number of clusters being too large . The life cycle management complexity of the cluster itself will increase with the increase in the size and number of clusters. Taking Meituan as an example, the multi-center and multi-cluster solution adopted by us in two places avoids the hidden danger of excessively large clusters to a certain extent, and solves problems such as business isolation and regional delay. With the emergence of cloud requirements for edge cluster scenarios and PaaS components such as databases, it is foreseeable that the number of small clusters will have a clear upward trend. As a result, the complexity of cluster management, monitoring configuration costs, and operation and maintenance costs increase significantly. At this time, the cluster scheduling system needs to provide more effective operating specifications, and ensure operational security, alarm self-healing, and change efficiency.

Trade-offs when designing a cluster scheduling system

To address the above challenges, a good cluster scheduler will play a key role. But there is never a perfect system in reality, so when designing a cluster scheduling system, we need to make trade-offs among several contradictions according to the actual scenario:

  1. The system throughput and scheduling quality of the cluster scheduling system . System throughput is an important criterion for us to evaluate the quality of a system, but in an online service-oriented cluster scheduling system, the scheduling quality is more important. Because the impact of each scheduling result is long-term (days, weeks, or even months), non-abnormal conditions will not be adjusted. Therefore, if the scheduling result is wrong, it will directly lead to increased service delay. The higher the scheduling quality, the more computational constraints need to be considered, and the worse the scheduling performance, the lower the system throughput.
  2. Architecture Complexity and Scalability of Cluster Scheduling Systems . The more functions and configurations the system opens to upper-layer PaaS users, and the user experience can be improved by supporting more functions (such as supporting application resource preemption and recovery and application instance abnormal self-healing), which means that the higher the system complexity, the higher the level of system complexity. more prone to conflict.
  3. of Cluster Scheduling Systems and Single Cluster Scale . The larger the scale of a single cluster, the larger the schedulable range, but the greater the reliability challenge to the cluster, because the explosion radius will increase and the impact of failure will be greater. When the scale of a single cluster is small, although the scheduling concurrency can be improved, the schedulable range becomes smaller, the probability of scheduling failure becomes higher, and the cluster management complexity becomes larger.

At present, the cluster scheduling system in the industry can be divided into five different architectures: single-level scheduler, two-level scheduler, shared-state scheduler, distributed scheduler and hybrid scheduler (see Figure 1 below). They all make different choices according to their respective scene needs, and there is no absolute good or bad.

图1 集群调度系统架构分类(摘自Malte Schwarzkopf - The evolution of cluster scheduler architectures)

  • scheduler uses a complex scheduling algorithm combined with the global information of the cluster to calculate high-quality placement points, but with high latency. Such as Google's Borg system, the open source Kubernetes system.
  • two-level scheduler solves the limitations of a single scheduler by separating resource scheduling and job scheduling. The two-level scheduler allows different job scheduling logic according to specific applications, and at the same time maintains the characteristics of sharing cluster resources between different jobs, but cannot achieve preemption of high-priority applications. Representative systems are Apache Mesos and Hadoop YARN.
  • Shared state scheduler solves the limitations of the two-level scheduler in a semi-distributed way. Each scheduler in the shared state has a copy of the cluster state, and the scheduler independently updates the cluster state copy. Once the local copy of the state changes, the state information of the entire cluster will be updated, but continuous resource contention will cause the performance of the scheduler to degrade. Representative systems are Google's Omega and Microsoft's Apollo.
  • distributed scheduler uses a relatively simple scheduling algorithm to achieve large-scale high-throughput, low-latency parallel task placement, but due to the relatively simple scheduling algorithm and the lack of a global resource usage perspective, it is difficult to achieve high-quality job placement effect , representative systems such as Sparrow at the University of California.
  • Hybrid Scheduler the workload across centralized and distributed components, using complex algorithms for long-running tasks and relying on a distributed layout for short-running tasks. Microsoft Mercury has taken this approach.

Therefore, how to evaluate the quality of a scheduling system mainly depends on the actual scheduling scenario. Take YARN and Kubernetes, which are the most widely used in the industry, as examples. Although both systems are general-purpose resource schedulers, in fact, YARN focuses on offline batch processing of short tasks, and Kubernetes focuses on online long-running services. In addition to the differences in architectural design and functions (Kubernetes is a monolithic scheduler, YARN is a two-level scheduler), the design concepts and perspectives of the two are also different. YARN focuses more on tasks, focuses on resource reuse, and avoids multiple copies of remote data. The goal is to perform tasks at lower cost and higher speed. Kubernetes is more focused on service status, focusing on peak shift, service profiling, and resource isolation, with the goal of ensuring service quality.

The evolution of the Meituan cluster scheduling system

In the process of implementing containerization, Meituan changed the core engine of the cluster scheduling system from OpenStack to Kubernetes according to the needs of business scenarios. By the end of 2019, Meituan completed the set goal of exceeding 98% of the online business containerization coverage. However, it still faces the problems of low resource utilization and high operation and maintenance costs:

  • The overall resource utilization of the cluster is not high. For example, the average utilization of CPU resources is still at the average level in the industry, which is far behind other first-tier Internet companies.
  • The containerization rate of stateful services is not enough. In particular, products such as MySQL and Elasticsearch do not use containers. There is a large room for optimization of business operation and maintenance costs and resource costs.
  • From the perspective of business needs, VM products will exist for a long time. VM scheduling and container scheduling are two sets of environments, resulting in high operation and maintenance costs for team virtualization products.

Therefore, we decided to start a cloud-native transformation of the cluster scheduling system. Build a large-scale high-availability scheduling system with multi-cluster management and automated operation and maintenance capabilities, support scheduling policy recommendation and self-service configuration, provide cloud-native underlying expansion capabilities, and improve resource utilization while ensuring application service quality. The core work is to build a scheduling system around the three major directions of maintaining stability, reducing costs, and improving efficiency.

  • to maintain stability : Improve the robustness and observability of the scheduling system; reduce the coupling between the modules of the system and reduce the complexity; improve the automatic operation and maintenance capabilities of the multi-cluster management platform; optimize the performance of the core components of the system; ensure large-scale clusters availability.
  • cost : In-depth optimization of the scheduling model, opening up the link between cluster scheduling and single-machine scheduling. From static resource scheduling to dynamic resource scheduling, offline business containers are introduced to form a combination of free competition and strong control. On the premise of ensuring high-quality business application service quality, it improves resource utilization and reduces IT costs.
  • efficiency : Support users to self-adjust scheduling policies to meet individual business needs, actively embrace the cloud-native field, and provide PaaS components with core capabilities including orchestration, scheduling, cross-cluster, and high availability to improve operation and maintenance efficiency.

图2 美团集群调度系统架构图

Finally, the Meituan cluster scheduling system architecture is divided into three layers according to the field (see Figure 2 above), the scheduling platform layer, the scheduling policy layer, and the scheduling engine layer:

  • The platform layer is responsible for business access, opening up Meituan's infrastructure, encapsulating native interfaces and logic, and providing container management interfaces (capacity expansion, update, restart, shrinkage) and other functions.
  • The strategy layer provides unified scheduling capabilities for multiple clusters, continuously optimizes scheduling algorithms and strategies, and improves CPU usage and allocation rates through service classification based on information such as service levels and sensitive resources of services.
  • The engine layer provides Kubernetes services to ensure the stability of cloud-native clusters of multiple PaaS components, and sinks general capabilities to the orchestration engine to reduce the access cost of business cloud-native landing.

Through refined operation and product function polishing, we have managed to manage nearly one million container/virtual machine instances of Meituan on the one hand, and on the other hand, we have improved the resource utilization from the industry average to the first-class level, and also supported PaaS components. Containerization and cloud native landing.

Multi-cluster unified scheduling: improve data center resource utilization

Resource utilization is one of the most important indicators for evaluating the quality of a cluster scheduling system. So while we finished containerization in 2019, containerization is not an end, it's a means. Our goal is to bring more benefits to users by switching from the VM technology stack to the container technology stack, such as comprehensively reducing the computing cost of users.

The improvement of resource utilization is limited by the individual hotspot hosts of the cluster. Once the capacity is expanded, the service container may be expanded to the hotspot host, and the service performance indicators such as TP95 will fluctuate in time, so that we can only be like other companies in the industry. Guarantee service quality by increasing resource redundancy. The reason is that the allocation method of the Kubernetes scheduling engine simply considers the Request/Limit Quota (Kubernetes sets the request value Request and the constraint value Limit for the container as the resource quota for the user to apply for the container), which belongs to static resource allocation. As a result, although different hosts are allocated the same amount of resources, due to the service differences of the hosts, the resource utilization rate of the hosts is also quite different.

In academia and industry, there are two common approaches to resolve the conflict between resource usage efficiency and application service quality. The first method is to solve it from a global perspective through an efficient task scheduler; the second method is to strengthen the resource isolation between applications through stand-alone resource management methods. Either way, it means that we need to fully grasp the state of the cluster, so we do three things:

  • The association between cluster status, host status, and service status is systematically established, and combined with the scheduling simulation platform, the peak utilization rate and average utilization rate are comprehensively considered, and the prediction and scheduling based on the historical load of the host and the real-time business load are realized.
  • Through the self-developed dynamic load adjustment system and cross-cluster rescheduling system, the linkage between cluster scheduling and single-machine scheduling link is realized, and service quality assurance strategies for different resource pools are realized according to business classification.
  • After three iterations, it has realized its own cluster federation service, better solved the problems of resource pre-emption and state data synchronization, improved the scheduling concurrency between clusters, and realized computing separation, cluster mapping, load balancing and cross-cluster orchestration. control (see Figure 3 below).

图3 集群联邦V3版本架构

The third version of the cluster federation service (Figure 3) is divided into a Proxy layer and a Worker layer according to modules, and is deployed independently:

  • The Proxy layer will select the appropriate cluster for scheduling based on the factors and weights of the cluster state, and select the appropriate Worker to distribute requests. The Proxy module uses etcd for service registration, master selection and discovery. The Leader node is responsible for preempting tasks during scheduling, and all nodes can be responsible for query tasks.
  • The Worker layer handles some of the cluster's query requests. When a cluster task is blocked, a corresponding Worker instance can be quickly expanded to alleviate the problem. When the scale of a single cluster is large, it will correspond to multiple Worker instances, and the Proxy distributes scheduling requests to multiple Worker instances for processing, improving scheduling concurrency and reducing the load of each Worker.

Finally, through the unified scheduling of multiple clusters, we realized the transition from the static resource scheduling model to the dynamic resource scheduling model, thereby reducing the proportion of hotspot hosts, reducing the proportion of resource fragments, ensuring the service quality of high-quality business applications, and integrating the servers of online business clusters. The average CPU utilization increased by 10 percentage points. The calculation method of the average cluster resource utilization: Sum(nodeA.cpu.current number of cores + nodeB.cpu.current number of cores + xxx) / Sum(nodeA.cpu.total number of cores + nodeB.cpu.total number of cores + xxx ), one point per minute, and all values of the day are averaged.

Scheduling Engine Service: Empowering PaaS Service Cloud Native Landing

In addition to solving the problem of resource scheduling, the cluster scheduling system also solves the problem of computing resources used by services. As mentioned in the book "Software Engineering at Google", the cluster scheduling system, as one of the key components in Compute as a Service, must solve both resource scheduling (from physical machine dismantling to resource dimensions such as CPU/Mem) and resources Competition (solving "noisy neighbors"), but also application management (instance automatic deployment, environmental monitoring, exception handling, guaranteeing the number of service instances, determining the amount of resources required by business, different types of services, etc.). And to a certain extent, application management is more important than resource scheduling, because it will directly affect the development, operation and maintenance efficiency of the business and the effect of service disaster recovery. After all, the labor cost of the Internet is higher than the machine cost.

Containerization of complex stateful applications has always been a challenge in the industry, because distributed systems in these different scenarios often maintain their own state machines. When the application system is scaled or upgraded, how to ensure the availability of existing instance services and how to ensure the connectivity between them is a much more complicated and thorny problem than stateless applications. While we've containerized stateless services, we haven't realized the full value of a good cluster scheduling system. If you want to manage computing resources, you must manage the status of services, separate resources and services, and improve service resilience, which is what the Kubernetes engine is good at.

Based on the customized Kubernetes version optimized by Meituan, we created the Meituan Kubernetes Engine Service MKE:

  • strengthens cluster operation and maintenance capabilities improves the automatic operation and maintenance capabilities of clusters, including cluster self-healing, alarm system, event log analysis, etc., and continuously improves the observability of clusters.
  • key business benchmark, , and has in-depth cooperation with several important PaaS components to quickly optimize users' pain points such as sidecar upgrade management, operator grayscale iteration, and alarm separation to meet users' demands.
  • continues to improve the product experience continues to optimize the Kubernetes engine. In addition to supporting users to use custom operators, it also provides a general scheduling and orchestration framework (see Figure 4) to help users access MKE at a lower cost and obtain technology dividend.

图4 美团Kubernetes引擎服务调度和编排框架

In the process of promoting the implementation of cloud native, a widely concerned question is: what is the difference between managing stateful applications based on the Kubernetes cloud native approach compared to building a management platform by ourselves?

For this problem, we need to consider the root of the problem - operability:

  • Based on Kubernetes means that the system is closed, and there is no need to worry about the data inconsistency that often occurs between the two systems.
  • The abnormal response can be at the millisecond level, which reduces the RTO (Recovery Time Objective) of the system, which mainly refers to the longest time that the business can tolerate to stop the service, and is also required from the occurrence of the disaster to the recovery of the service function of the business system. minimum time period).
  • The complexity of system operation and maintenance is also reduced, and the service achieves automatic disaster recovery. In addition to the service itself, configuration and state data that the service depends on can be restored together.
  • Compared with the previous "chimney-style" management platform of various PaaS components, general capabilities can be transferred to the engine service, reducing development and maintenance costs. By relying on the engine service, the underlying heterogeneous environment can be shielded and the cross-data center and multi-cloud environment can be realized. service management.

Looking Ahead: Building a Cloud-Native Operating System

We believe that cluster management in the cloud-native era will fully transform from the previous functions of managing hardware and resources to an application-centric cloud-native operating system. With this as the goal, the Meituan cluster scheduling system also needs to make efforts from the following aspects:

  • Application Link Delivery Management . With the increase of business scale and link complexity, the operation and maintenance complexity of the PaaS components and underlying infrastructure that the business relies on has already exceeded the general perception, and it is even more difficult for newcomers who have just taken over the project. Therefore, we need to support the business to deliver services through declarative configuration and realize self-operation and maintenance, provide a better operation and maintenance experience for the business, improve the usability and observability of the application, and reduce the business's burden on the management of underlying resources.
  • Computing . With the continuous enrichment of Meituan's business scenarios, the business's demand for edge computing nodes has grown much faster than expected. We will refer to the best practices in the industry to form an edge solution suitable for landing in Meituan, provide edge computing node management capabilities for services in demand as soon as possible, and achieve cloud-edge-device collaboration.
  • for capacity building in offline co- . There is an upper limit to the resource utilization improvement of online business clusters. According to the 2019 data center cluster data disclosed by Google in the paper "Borg: the Next Generation", the resource utilization rate of online tasks is only about 30%, excluding offline tasks. , which also shows that the risk of further promotion is higher, and the input-output ratio is not high. In the future, the Meituan cluster scheduling system will continue to explore offline co-location. However, since Meituan's offline computer room is relatively independent, our implementation path will be different from the general solution in the industry. We will start with the combination of online services and near real-time tasks. Start with the department, complete the construction of the underlying capabilities, and then explore the mix of online tasks and offline tasks.

Summarize

The Meituan cluster scheduling system is designed in accordance with the appropriate principles as a whole. In the case of meeting the basic needs of the business, after ensuring the stability of the system, gradually improve the architecture, improve performance and enrich functions. Therefore, we have chosen:

  • In the system throughput and scheduling quality, we choose to give priority to meeting the throughput requirements of the business on the system, and do not excessively pursue the single scheduling quality, but adjust and improve through rescheduling.
  • In terms of architectural complexity and scalability, we choose to reduce the coupling between the various modules of the system, reduce the complexity of the system, and the extension function must be degradable.
  • In terms of reliability and single-cluster scale, we choose to control the scale of a single-cluster through unified scheduling of multiple clusters to ensure system reliability and reduce the explosion radius.

In the future, we will continue to optimize and iterate Meituan's cluster scheduling system based on the same logic, and completely transform it into an application-centric cloud-native operating system.

About the Author

Tan Lin, from Meituan Basic R&D Platform/Basic Technology Department.

Read more technical articles collection of Meituan technical team

the front | algorithm | backend | data | security | operation and maintenance | iOS | Android | test

| reply to keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.


美团技术团队
8.6k 声望17.6k 粉丝