Author: Zhimin, Shuyuan

Looking back on 2021, what are the important events in the cloud native field?

1. Accelerated implementation of container-based distributed cloud management:

the Alibaba Cloud Summit in May 2021, Alibaba Cloud released a multi-modal deployment method for one cloud. A cloud based on the Apsara architecture can fully cover various computing scenarios from core regions to customer data centers, providing customers with low-cost , low-latency, localized public cloud offerings.

Before the release of One Cloud Multi-Form, Alibaba Cloud Container Service released the ability to register clusters under cloud Kubernetes at the 2019 Yunqi Conference, which supports unified management of different Kubernetes clusters on and off the cloud. In 2021, Alibaba Cloud Container Service will further comprehensively upgrade the unified management of central cloud, local cloud, and edge cloud container clusters. It will be able to deploy mature cloud-native observability and security protection capabilities to the user environment, as well as the advanced middleware in the cloud. , data analysis and AI capabilities are sunk locally to meet customers' needs for product richness and data management and control, and to accelerate business innovation. And relying on powerful elastic computing power, by hosting elastic nodes, enterprises can expand from local to cloud on demand, achieve second-level scaling, and calmly cope with periodic or sudden business traffic peaks.

As of 2021, based on Kubernetes to shield differences in heterogeneous environments, building a distributed cloud architecture has become the consensus of enterprises and cloud vendors.

2. Knative 1.0 is officially released:

As an open source serverless orchestration framework based on Kubernetes, Knative provides serverless application orchestration capabilities for Kubernetes standardized APIs. Knative supports many features: automatic elasticity based on traffic, grayscale publishing, multi-version management, scaling down to 0, event-driven Eventing, etc. According to the CNCF 2020 China Cloud Native Survey Report, Knative has become the first choice for installing Serverless on Kubernetes.

In November 2021, Knative released version 1.0, the same month Google announced that it would donate Knative to the Cloud Native Computing Foundation (CNCF). Alibaba Cloud provides Knative hosting, and provides enhancements such as cold-start optimization and prediction-based intelligent elasticity in combination with Alibaba Cloud infrastructure, realizing the deep integration of community standards and cloud service advantages.

What breakthroughs will container technology make in 2021? What's the problem behind it?

In 2021, enterprises will embrace containers more actively, and have higher requirements for the startup efficiency, resource overhead, and scheduling efficiency of container core technologies. The Alibaba Cloud container team also supports a new generation of container architecture upgrades. , bare metal, operating system and other full-stack optimizations, continue to tap the potential of containers.

Efficient Scheduling: newly upgraded Cybernetes scheduler supports NUMA load sensing, topology scheduling, and fine-grained resource isolation and co-location for multi-architecture Dragons, improving application performance by 30%. In addition, a lot of end-to-end optimization has been done on the scheduler. In a 1000-node cluster, it can provide a scheduling speed of more than 20,000Pods/min, ensuring that both online services and offline tasks can run efficiently on K8s;

High-performance container network: The latest generation of Alibaba Cloud container network Terway 3.0, on the one hand, offload virtualization network overhead through the Shenlong chip, on the other hand, implements container service forwarding and network strategy through eBPF in the OS kernel, truly achieving zero loss, high performance.

Container-optimized OS: oriented towards container scenarios and launches LifseaOS, a container-optimized operating system that is lightweight, fast, secure, and image atomically managed. Compared with traditional operating systems, the number of software packages is reduced by 60%, and the image size is reduced by 70%. The traditional OS's time of more than 1min has dropped to about 2s. Support image read-only and ostree technology, manage OS image version, update the software package on the operating system, or update the fixed configuration with the whole image as the granularity.

density deployment is extremely flexible: based on Alibaba Cloud Security Sandbox 2.0, which optimizes the resource overhead in the sandbox container, with a minimum of about 30M, and realizes the high-density service capability of 2000 instances on a single physical machine. At the same time, by shortening the management and control links and simplifying the components, supplemented by the optimization of the sandbox memory allocation process, the host cgroup management process, and the IO link, the elastic capability of 3000 elastic container instances in 6 seconds in the serverless scenario is realized.

What are the trends in the scale of enterprises' application of containers? What are the core demands?

With the further large-scale use of containers by enterprises, the scope of using containers within enterprises has gradually evolved from online business to AI big data, and there are more and more demands for the management of heterogeneous resources such as GPUs and the management of AI tasks and jobs. . At the same time, developers are considering how to use cloud-native technologies to support more types of workloads with a unified architecture and technology stack to avoid different loads, use different architectures and technologies, and bring "chimney" systems, duplication of investment and operation and maintenance. burden.

Deep learning and AI tasks are one of the important workloads that the community seeks for cloud native technology support. Cloud, we proposed the definition, technical overview and reference architecture of " Cloud Native AI ", in order to provide practical best practices for this new technology field, and launched a cloud native AI suite, which uses data computing The scheduling and management of similar tasks, as well as the containerized unified scheduling and operation and maintenance of various heterogeneous computing resources, significantly improve the resource utilization efficiency of heterogeneous computing clusters such as GPU/NPU and the delivery speed of AI projects.

According to the characteristics of AI computing tasks, a large number of extensions and enhancements have been made on the basis of the core Scheduler Framework of Kubernetes, and task scheduling strategies such as Gang Scheduling, Capacity Scheduling, and Binpack are provided to improve the resource utilization of the cluster. And actively cooperate with the K8s community to continue to promote the evolution of the K8s scheduler framework, ensuring that the K8s scheduler can expand various scheduling strategies on demand through the standard plugin mechanism to meet the scheduling needs of various workloads. At the same time, it avoids the risk of data inconsistency caused by other custom schedulers for cluster resource allocation.

It supports GPU shared scheduling and topology-aware scheduling, and customized chip scheduling such as NPU/FPGA to improve the resource utilization of AI tasks. At the same time, through Alibaba Cloud's self-developed cGPU solution, it provides GPU memory and computing power without modifying the application container. isolation.

Driven by the background of separation of computing and storage, based on Fluid, it provides a layer of efficient and convenient data abstraction, abstracts data from storage, and realizes the integration between data and computing through data affinity scheduling and distributed cache engine acceleration. , thereby speeding up computing access to data. And supports Alluxio and JindoFS as the cache engine.

It supports the elastic scaling of heterogeneous resources such as GPU, and avoids unnecessary consumption of cloud resources through intelligent peak shaving and valley filling. Both elastic model training and model inference are supported.

What new requirements have enterprises put forward for the application of containers?

With the development of industries and services such as 5G, IoT, audio and video, live broadcast, CDN, etc., we see an industry trend: enterprises begin to sink more computing power and services to places closer to data sources or end users, This results in better response times and lower costs.

This is clearly different from the traditional central cloud computing model, thus extending edge computing. As an extension of cloud computing, edge computing will be widely used in hybrid cloud/distributed cloud, IoT and other scenarios. It requires future infrastructure to be decentralized, edge facility autonomy, and powerful edge cloud hosting capabilities. The new boundary of cloud-native architecture—the “integration of cloud, edge and device” IT infrastructure has begun to appear in front of the entire industry, and this is also the demand of enterprises for cloud-native technologies and containerized applications to be implemented in new scenarios.

The cloud-native architecture and technical system of edge computing need to solve the following problems: cloud-side operation and maintenance coordination, elastic coordination, network coordination, edge IoT device management, lightweight, cost optimization, etc. In response to the new requirements of cloud-edge-device integration, in 2021, the OpenYurt community (CNCF Sandbox project) also released versions 0.4 and 0.5 to continuously optimize the IoT device management, resource overhead, and network collaboration capabilities of edge containers.

From a technical perspective, what are the main problems that need to be solved urgently in container development?

With the large-scale use and implementation of K8s applications in enterprises, how to continuously improve the overall stability of the K8s cluster is the core challenge. As a distributed system, the K8s cluster is highly complex, and problems in any part of the application, infrastructure, and deployment process may lead to the failure of the business system. This not only requires enterprises applying K8s to have a high availability system for cloud-native container technology, but also requires an overall upgrade of the enterprise cloud-native operation and maintenance system concept.

uses the SLO definition to drive the observability system: has built the normalization capability of performance stress testing for the capacity scale of K8s. It must be able to measure the number of nodes, the number of PODs, the number of jobs, and the QPS of the core Verb for the business scenarios on the K8s cluster. Numbers are clearly understood. Combining with the real scenarios of the business, SLO is sorted out, and continuous attention is paid to golden indicators such as request volume, delay, number of errors, and saturation.

Normalized fault drill and chaos test: , for example, ChaosBlade, which combines the concept of chaos engineering, injects different abnormal cases into different risk actions of container clusters, and simulates all aspects of faults from VM, K8s, network, storage to application.

Fine-grained flow control and risk control: protection capabilities against anomalies found during stress testing and fault drills. With the help of Kubernetes 1.20 beta, fine-grained flow control policies for API priority and fairness can be used. Alibaba Cloud Container Service also has built-in self-developed UserAgentLimiter to further protect K8s.

In addition to the construction of the global high availability capability, it is necessary to build the platform-based capability of the SRE team:

Create a unified K8s operation and maintenance service interface, accumulate operation and maintenance and observable capabilities, and enable each SRE/DEV to be able to OnCAll or support indiscriminately. There are two sub-goals: 1) Try to avoid problems; 2) Find and locate problems as soon as possible , and restore the problem as soon as possible, and build a global high-availability emergency system.

Emphasis on practice and drills: practice based on the scene, and the unity of knowledge and action. It is a closed loop from triggering knowledge to completion of action, and then goes through a cyclic process of knowledge and action. Training with competitions, such as the Double Eleven promotion, power outages, network interruptions and other extreme scenarios, stability construction needs to be carried out for extreme scenarios, capacity planning and stress testing, component management, etc., all require some special scenarios to spawn. With the arena, to fight this battle well, we need to work together, and a large collaborative mechanism will continue to be formed.

Solidifies knowledge and precipitates playbook: This matter is to create standards. In the process of making standards, some of them fall into the system first, some are deposited in the playbook, and some are reflected in the process. The process must be ours. Best practices for great engineers and SREs. Systems, playbooks, and processes are constantly transformed and complement each other.

What will be the focus of container technology in 2022? What is the imaginary space for the future of containers?

A few days ago, the international authoritative consulting agency Forrester released the global container capability report "The Forrester WaveTM: PublicCloud Container Platforms, Q1 2022". The report shows that Cloud is the only domestic service provider to enter the "Leader" quadrant of the report, and its container products are comprehensive Ability score is highest.

Cloud container technology will focus on several directions in 2022:

Green and low carbon: continue to leverage the efficient scheduling and flexibility of container technology to help enterprises improve overall IT efficiency. Combined with the latest energy-saving data center technology, a new generation of Shenlong architecture, self-developed chips, and a container-optimized operating system to achieve upstream and downstream full-stack optimization, improve the overall performance and scheduling efficiency of applications. In a data-driven way, intelligent scheduling and real-time adjustment are realized according to the application runtime resource profile, which simplifies the complexity of application resource configuration, further improves the hybrid deployment of applications, reduces resource costs, and facilitates the overall FinOps management of enterprises.

AI engineering: for AI to become enterprise productivity, it is necessary to use engineering technology to solve the problems of model development, deployment, management, prediction, reasoning and other full-link life cycle management. We found that there are three urgent things in the field of AI engineering: of data and computing power, scaling of scheduling and programming paradigms, and standardization of development and services. These require continuous optimization of efficient scheduling of heterogeneous architectures such as GPUs, combined with technologies such as distributed caching and distributed dataset acceleration, combined with KubeflowArena's AI task pipeline and lifecycle management, to comprehensively upgrade AI engineering capabilities.

Intelligent autonomy: promotes the intelligent operation and maintenance system of containers by introducing more data-based and intelligent means, reduces the management of complex container clusters and applications by enterprises, and enhances the self-healing and self-recovery capabilities of K8s masters, components and nodes. Provides more friendly exception diagnosis, K8s configuration recommendation, elastic prediction and other capabilities.

Security compliance: comprehensively promotes the evolution from DevOps to DevSecOps. Optimize the overall security definition, signature, synchronization and tripartite delivery for OCI Artifacts such as Helm and Operator; strengthen the north-south and east-west network isolation and governance of containers, and promote zero-trust link security; further improve security containers and confidential computing containers performance and observability.

Click ​here ​​to enter the official website of Alibaba Cloud ACK Anywhere.​


阿里云云原生
1.1k 声望327 粉丝