Alibaba Cloud ACK container service production-level observable system construction practice

Author: Feng Shichun (Xingji)

ACK observable system introduction

Overview of Panorama

title=

The above picture shows the panorama pyramid of the ACK observable system, which can be divided into four layers from top to bottom:

The top layer is Business Monitoring that is closest to the user's business, including monitoring of the front-end traffic, PV, front-end performance, and JS response speed of the user's business. Through the IngressDashboard of the container service to monitor the request volume and the status of the Ingress, users can customize the business log, and realize the custom monitoring of the business through the log monitoring of the container service.

The second layer is Application Performance Monitoring , including user's application monitoring. ARMS APM products provide users with capabilities such as Java Profiling and Tracing, and also support multi-language monitoring solutions of OpenTracing and OpenTelemetric protocols.

The third layer is Container Monitoring , which includes container cluster resources, container runtime layer, container engine, and the stability of container clusters. Use Alibaba Cloud Prometheus to display resource applications at different cluster levels in a global view dashboard, including performance, water level, cloud resources, as well as event system and log system, covered by event center and log center.

The bottom layer is Infrastructure Monitoring , including different cloud resources, virtualization layer, operating system kernel layer, etc. Both the container layer and the infrastructure layer can use the eBPF-based non-intrusive architecture and K8s monitoring capabilities for network and call tracing.

Each layer of the observable system is mapped to different degrees with the three observable pillars, Logging, Tracing, and Metrics.

Scenario 1: Practice of observability in abnormal diagnosis scenarios

title=

User's abnormal diagnosis case

At 9:00 in the morning, an abnormal diagnosis occurred when the multi-service traffic surged, and a container alarm was received indicating that a Pod Down was affecting the service traffic. After receiving the alarm, the user reacts quickly, restarts or expands the Pod of the core business, and finds the root cause of the problem. First, through the top-down analysis of the ingress traffic through the IngressDashboard, it is found that the success rate of access to external services has dropped and the request with a 4XX return code appears, indicating that this anomaly affects user services. From the resource and load level analysis, it can be found that the water level load is caused by the 9-to-5 traffic flow. At 9:00 in the morning, there is an obvious water level spike aligned with the fault, which is also the fault.

The system immediately alerted the Pod location of the core business, analyzed the business log, and monitored the APM application of ARMS Java. It was located that the cache bug was caused by the soaring business traffic at 9 o'clock in the morning, which resulted in frequent database reads and writes. , the call chain also reflects that there may be a slow query of the database, and finally the entire exception is completely closed by fixing the bug.

The above process is very typical throughout the entire ACK observable abnormal diagnosis process of different capabilities, which can help us better understand how the ACK observable systems coordinate with each other to complete the work.

Introduction to the event system

title=

The K8s of the community contains a very mature event system, which provides events at the application layer and events at the runtime layer. The ACK observable system is covered and enhanced from the surface layer to the bottom layer on top of the community's event system, achieving full coverage of the observable event system.

Application exceptions : For K8s application events, it provides event monitoring of user grayscale releases and abnormal behaviors such as HPA.
Management and control operation events : Added cluster management and control events, abnormal user operations on the cluster, and important changes, even including cost and budget overruns.
Abnormal cluster core components : The stability of the cluster is largely guaranteed by the health of the cluster core components. The core components of the cluster, including API server, ETCD, Scheduler, CCM, etc., have been enhanced for abnormal events, and abnormal events can be reached at the first time. In addition, it also includes addon events of user-side core components, such as Terway, storage, and so on.
Cluster container engine layer exceptions : The cluster container engine layer has been enhanced, including exceptions such as Container Runtime, Kubelet, and Cgroup.
Node exceptions : including OS/kernel layer exceptions, such as operating system kernel downtime, operating system configuration exceptions, etc., as well as resource layer exceptions such as network resource exceptions, storage resource exceptions, other cloud resource exceptions, etc., for the operation and maintenance of container services Guaranteed and enhanced coverage provides support.

title=

ACK provides an out-of-the-box event center capability, one-click to open the event center, you can enjoy the powerful functions brought by the complex event system. It provides a preset test center dashboard, which can highlight and count important events. It also provides powerful, flexible and easy-to-use data analysis capabilities, which provide the basis for the event-driven OPS system. The ACK K8s cluster is more about managing and controlling the life cycle of resources, and the event center also provides the ability to manage and control the life cycle of resources with events as the anchor point. It can debug and optimize the performance of the most important time points in the life cycle and quickly respond to abnormal Pod states.

Introduction to the log system

title=

The first usage scenario of Logging in K8s is important traffic, such as Ingress. The log center of ACK provides a dashboard of important scenarios such as Ingress by default. After one-click access to the Ingress dashboard, you can quickly view the cluster Ingress traffic. In addition, it also includes PV, UV, application abnormal status and statistics, which is fast, clear, and easy to apply.

The second log usage scenario is auditing. The resources in the cluster are often accessed and used by different accounts, and the security of the cluster also needs urgent attention. We provide an audit log dashboard, which can quickly analyze the access and usage traces of cluster resources, and provide alarms and early warnings for unauthorized access, providing a more secure environment for the cluster.

We provide a cloud-native and non-intrusive log acquisition method. Users only need to use a simple CRD or annotation on the Pod to collect logs to the log center, and enjoy the multi-dimensional and powerful analysis capabilities of the log center.

Introduction to Metric System

title=

The Metrics system is the most commonly used system for stability assurance and performance tuning. Key indicators such as water level can be displayed visually on the market. The ARMS Prometheus market is preset on the product side. After purchasing the ACK K8s cluster, you can open the Prometheus market with one click. In addition, the Prometheus market with important scenarios preset in the system has been precipitated by mature experience in business operation and maintenance on K8s. The Prometheus solution involves different services, including not only the core K8s application, network, and core control plane indicators on the container service side, but also external storage scenarios such as AI scenarios, GPU indicators, and storage CSI, as well as resource optimization or cost optimization index.

Using the unified Prometheus data link solution, it does not include indicators that can support container scenarios, but also supports indicators of different cloud products and indicators of different middleware on cloud products. All levels of indicators can be displayed in the Prometheus data link. .

The core components of the ACK cluster control plane, API Server, ETCD, scheduler, and CCM, have also been enhanced. The Pro cluster is not only responsible for hosting these core components and maintaining its SLA, but also exposing transparent performance indicators to users, giving users peace of mind.

Scenario 2: Stability Guarantee - 2022 Winter Olympics ACK helps to be held successfully and smoothly

title=

Indicator scenarios are an important support capability for stability assurance. ACK is very honored to serve the 2022 Winter Olympics and help the Winter Olympics system run smoothly and smoothly.

Several core business systems of the Winter Olympics are deployed in the ACK cluster, including the international official website of the Winter Olympics, competition venues, ticketing systems, etc., to escort multiple core systems. The core system is mostly Java-based microservice architecture, and there are nearly a thousand Deployment instances in actual use. We conducted capacity evaluation by introducing stress testing, and cooperated with the first-screen O&M dashboard customized for the Winter Olympics to ensure the stability of applications and clusters in real time, ensuring smooth access to the Winter Olympics system.

Scenario 3: Practice of Stability Guarantee for Production-Scale Clusters

title=

Many users have large-scale production systems. After reaching the cluster scale of thousands of nodes, users will perform intensive and large-scale cluster resource access on the cluster, which is prone to cluster stability problems.

For example, if users frequently and intensively access cluster resources in a large-scale cluster, first, the number of requests for API Server will be high, and the number of requests for API Server Mutating will also be high. If the API Server load is too high, requests will be discarded. This is also the degradation feature of the API Server, which will affect the release of user services or changes of users.

For another example, intensive cluster resource access may also fill up the bandwidth of the API Server, the request delay RT of the API Server will rise to a high level, and an API access may take tens of seconds, which will seriously affect the user's business. The number of requests will also skyrocket.

We provide a monitoring dashboard for the core components of the control plane, which can quickly find the API Server water level and the delay of request response time, and then quickly locate which application and what action caused the API Server water level and resource request according to the API Server access log. High, and finally find a specific application to stop bleeding and solve the problem.

Prometheus For ACK Pro

title=

Alibaba Cloud recently launched Prometheus for ACK Pro, which is an upgraded service of Prometheus, which can see multiple data sources on the same large disk, including cluster event logs, eBPF-based non-intrusive application indicators, network indicators, etc., providing Consistent experience. Users can carry out investigations from different perspectives through the multi-data source and multi-angle observability capabilities through the correlation analysis logic of a large board, from overview to details.

Introduction to Tracing System

Application Layer Trace

title=

In the ACK observable system, the Tracing system provides the ability to finally locate the root cause, which is divided into two parts:

The first part is Tracing at the application layer, which provides ARMS APM capabilities, supports OpenTracing and OpenTelemetric protocols, and can support applications in multiple languages. It also provides non-intrusive APM capabilities for Java. You only need to mark the annotation on the Pod, and the Pod of the Java application can enjoy real-time monitoring data services, and you can view the real-time application water level, JVM performance indicators, application upstream and downstream distributed distribution It also supports Profiling and code stack-level call monitoring capabilities. Different languages can be aggregated into the same distributed call tracing map, and a distributed call can be viewed from top to bottom to locate and diagnose problems.

Cluster network, call Trace

title=

The second part is cluster networking and calling Trace.

Recently, we launched the tracing capability based on the eBPF network layer. Through the eBPF instrumentation technology, the network tracing capability with zero code changes and very low performance consumption is realized at the kernel level. It provides global topology, network topology display for quickly locating problem call chains, and resource-level display. It also supports the observation of observability from multiple perspectives of metrics, tracing, and logging in a unified global architecture view.

Building AIOps system based on ACK observable capability

title=

With the event-driven AIOps system, users can use events as a unified driving data source to discover and reach problems, as well as a bridge for AI intelligent O&M operations. With the ACK event center as the core, a unified event format specification is constructed. K8s events will be provided to users in a unified event configuration format. Finally, with the event center as the core, it will be provided to users through a unified event processing flow. Users can do intelligent operation and maintenance of events and build their systems by subscribing to events. Users can push business events through the business of an application, and perform intelligent operation and maintenance processing on business events, such as intelligent expansion or contraction.

In addition, we also provide an ACK alarm center, which builds an AIOps system for users through a unified alarm configuration, helping users quickly establish a subscription, sending and receiving, troubleshooting, and processing system for operation and maintenance.

title=

The Alarm Center will provide users with a unified configuration to help users quickly establish abnormal rule sets for abnormal diagnosis in ACK scenarios. The ACK Alarm Center provides out-of-the-box alarm capabilities, precipitation of common exception rule sets for container scenarios, out-of-the-box. Secondly, an ITOps system can be built through the fine-grained subscription relationship of alarm messages, and different exceptions can be delivered to those who can really solve the exception through the subscription configuration relationship of the alarm center. ACK also contains standard exceptions and SOP manuals corresponding to standard exception handling. When an alarm is found, it will prompt the exception type and provide users with a standard SOP repair process for handling exceptions.

Building a FinOps system based on ACK observable capabilities

title=

More and more users are faced with the problem of cost reduction and efficiency improvement in the cloud stage or in the post-cloud governance stage. There are mainly the following pain points:

Before going to the cloud - how to go to the cloud is difficult to plan;
After going to the cloud - there are many types of cloud products and types of cluster resources, and it is difficult to charge;
Highly SaaS-based applications are deployed and shared in the same cluster, and cost sharing is difficult;
Every year, new services are generated and offline, and the usage relationship between clusters and resources is dynamic, making it difficult to continuously optimize and govern;
In the past, Excel sheets were generally used to manage capabilities. In the cloud-native scenario, there are rich user applications and rich types of billing resources, which are difficult to manage.

ACK provides a cloud-native enterprise IT cost management solution, and estimates and allocates costs for cluster resources through a multi-dimensional cost allocation and estimation model. Cost insight can be obtained through root cause drill-down and trend prediction, and the cost of multiple application services on a cluster can be drilled down to a fine-grained level for cost splitting. Provides mature solution coverage for costs in multi-cluster scenarios, as well as expert services for enterprise cloud-native IT cost governance.

In addition, we have also launched built-in application resource portraits and intelligent recommendation of application resources, which can recommend appropriate costs for resources and perform budget control, and finally optimize costs according to different scenarios, such as big data, AI, games, etc.

Finally, support for diverse scenarios, including multi-cloud and hybrid clouds, can be displayed and managed on a unified plane.

Customer case

title=

As a leading company in Internet finance, China Property & Casualty has a cluster scale of 1,000 cores, and manages, operates, and maintains multiple SaaS-based online businesses. High sensitivity industry characteristics.

In the process of transforming from traditional IT architecture to cloud-native, China Property & Casualty faces challenges such as difficulty in capacity planning, difficulty in calculating costs, difficulty in finding idle resources, and difficulty in balancing cost optimization and business stability.

We carried out stress testing and capacity planning for it through ACK's cost management solution, managed and analyzed the bill of business splitting through ACK cost analysis, solved the optimization of idle resources, and provided it with an optimization strategy for allocating resources. Container Service provides fine-grained container deployment and optimization methods such as elastic policies.

Before going to the cloud, the idle rate of resource allocation of customer clusters was as high as 30%+, but through the cost management solution we provided, the idle rate dropped to below 10%, which is an industry-leading level.

To learn more about observability, click here .

Alibaba Cloud ACK container service production-level observable system construction practice

ACK observable system introduction

Overview of Panorama

Introduction to the event system

Introduction to the log system

Introduction to Metric System

Prometheus For ACK Pro

Introduction to Tracing System

Building AIOps system based on ACK observable capability

Building a FinOps system based on ACK observable capabilities

Customer case

阿里云云原生

引用和评论

JManus - 面向 Java 开发者的开源通用智能体

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

支付宝H5下载被拦截的原因排查与解决指南

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践

终于，AWS Aurora 也走向了融合架构，这一次阿里云 PolarDB-X 确实遥遥领先

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

容器化对数据库的性能有影响吗？