Vivo container cluster monitoring system architecture and practice

vivo Internet Server Team-YuanPeng

I. Overview

Since the promotion of container technology and the fact that Kubernetes has become the de facto standard in the field of container scheduling and management, cloud-native concepts and technical architecture systems have gradually been widely used in production environments. Under the cloud-native system, facing the characteristics of high elasticity, dynamic application life cycle management, and microservices, the traditional monitoring system has been difficult to cope with and support, so a new generation of cloud-native monitoring system came into being.

Currently, monitoring systems with Prometheus as the core have become the de facto standard in the field of cloud-native monitoring. As a new generation of cloud-native monitoring system, Prometheus has the characteristics of powerful query capability, convenient operation, efficient storage and convenient configuration operation. However, no system is omnipotent. In the face of complex and diverse production environments, a single Prometheus The system also cannot meet the various monitoring requirements of the production environment, and it is necessary to construct suitable monitoring methods and systems according to the characteristics of the environment.

Based on the practical experience of vivo container cluster monitoring, this paper discusses how to build the cloud-native monitoring architecture, the challenges encountered, and the corresponding countermeasures.

2. Cloud native monitoring system

2.1 Features and Values of Cloud Native Monitoring

Compared with traditional monitoring, cloud-native monitoring has its characteristics and value, which can be summarized in the following table:

2.2 Introduction to Cloud Native Monitoring Ecosystem

The monitoring-related projects in the CNCF ecological panorama are as follows (refer to https://landscape.cncf.io/ ), and the following highlights several projects:

Prometheus (graduated)

Prometheus is a powerful monitoring system and an efficient time series database, and has a complete monitoring system solution around it. A single Prometheus can efficiently process a large amount of monitoring data, and has a very friendly and powerful PromQL syntax, which can be used to flexibly query various monitoring data and configure alarm rules.

Prometheus is the second CNCF "graduated" project after Kubernetes (and the only "graduated" project in the current monitoring direction). The open source community is active and has nearly 40,000 Stars on Github.

The Pull indicator collection method of Prometheus is widely used, and many applications directly implement the metric interface based on the Prometheus collection standard to expose their own monitoring indicators. Even for applications that do not implement the metric interface, most of the corresponding exporters can be found in the community to indirectly expose monitoring indicators.

However, Prometheus still has some shortcomings. For example, it only supports single-machine deployment. Prometheus's own time series library uses local storage, so the storage space is limited by the disk capacity of a single machine. In the case of large data storage, the historical data query performance of Prometheus will be reduced. There are serious bottlenecks. Therefore, in large-scale production scenarios, a single prometheus is difficult to store long-term historical data and does not have high availability.

Cortex (incubating)

Cortex extends Prometheus to provide multi-tenancy, and provides Prometheus with the ability to connect to persistent storage, supports horizontal expansion of Prometheus instances, and provides a unified query entry for multiple Prometheus data.

Thanos (incubating)

Thanos provides a low-cost solution for long-term historical monitoring data storage by storing Prometheus monitoring data into object storage. Thanos provides a global view (global query) to the Prometheus cluster through the Querier component, and can apply the sample data compression mechanism of Prometheus to the historical data of object storage, and can also improve the query speed of large-scale historical data through the downsampling function, and No significant loss of precision will be incurred.

Grafana

Grafana is an open source metric analysis and visualization suite. It is mainly used for icon customization and display of time series data in the monitoring field. The UI is very flexible, with rich plug-ins and powerful extension capabilities, and supports a variety of different data sources (Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, Druid, etc.). Grafana also provides visual alarm customization capabilities, which can continuously evaluate alarm indicators and send alarm notifications.

In addition, the Grafana community provides monitoring and alarm panel configurations for a large number of common systems/components, which can be downloaded online with one click, which is simple and convenient.

VictoriaMetrics

VictoriaMetrics is a high-performance, economical and scalable monitoring solution and time series database. It can be used as a long-term remote storage solution for Prometheus, supports PromQL queries, is compatible with Grafana, and can be used to replace Prometheus as a data source for Grafana. It has the characteristics of simple installation and configuration, low memory usage, high compression ratio, high performance and support for horizontal expansion.

AlertManager

AlertManager is an alert component that receives alerts from Prometheus, processes them through grouping, silence, suppression and other strategies, and sends them to the designated alert receiver through routing. Alarms can be sent to different receivers according to different rules, and supports a variety of common alarm receivers, such as Email, Slack, or access to domestic IM tools such as enterprise WeChat and DingTalk through webhook.

2.3 How to build a simple cloud native monitoring system

After understanding the common tools in the field of cloud native monitoring, how to build a simple cloud native monitoring system? The following figure shows the official solution provided by the Prometheus community:

(Source: Prometheus Community )

The above system is expanded as follows:

All monitoring components are deployed in a cloud-native manner, that is, containerized deployment and unified management with Kubernetes.
Prometheus is responsible for indicator collection and monitoring data storage, and can automatically discover collection targets through file configuration or Kubernetes service discovery.
The application can let Prometheus pull the monitoring data through its own Metric interface or the corresponding exporter.
Some short-lived custom collection metrics can be collected by script programs and pushed to the Pushgateway component for Prometheus to pull.
Prometheus configures the alarm rules and sends the alarm data to the Alertmanager, and the Alertmanager processes it according to certain rules and policies and routes it to the alarm receiver.
Grafana configures Prometheus as the data source. After querying the monitoring data through PromQL, it displays the alarm panel.

2.4 How to build a cloud-native monitoring system with open capabilities, stability and efficiency

The above shows the construction plan of the monitoring system officially given by the community, but the main problems in the application of this plan in the production environment:

Prometheus stand-alone cannot store a large amount of long-term historical data;
Does not have high availability;
Does not have the ability to scale horizontally;
Lack of multi-dimensional monitoring and statistical analysis capabilities.

So for large-scale complex production environments, how to build a cloud-native monitoring system with open capabilities, stability and efficiency?

Combining vivo's own container cluster monitoring practical experience, various cloud-native monitoring related documents, and the technical architecture sharing of various manufacturers at the technical conference, we can summarize the cloud-native monitoring architecture suitable for large-scale production scenarios. The following figure shows the architecture of the system. Layered model.

Deployed in a cloud-native manner, the underlying layer uses Kubernetes as a unified control and management plane.
The monitoring layer uses the Prometheus cluster as the collection solution. The Prometheus cluster uses an open source/self-developed high-availability solution to ensure no single point of failure and provides load balancing capabilities. The monitoring indicators are exposed through the application/component's own Metric API or exporter.
The alarm data is sent to the alarm component for processing according to the specified rules, and then forwarded by the webhook to the company's alarm center or other notification channels.
In the data storage layer, a highly available and scalable time series database solution is used to store long-term monitoring data, and a suitable data warehouse system is also used to store a copy of the monitoring data for statistical analysis in more dimensions, providing a foundation for the upper layer to do data analysis.
The data analysis and processing layer can further analyze and process the monitoring data, provide more dimensional reports, mine more valuable information, and even use machine learning and other technologies to achieve automatic operation and maintenance purposes such as fault prediction and fault self-healing.

3. Vivo container cluster monitoring architecture

The architectural design of any system must be based on the characteristics of the production environment and business needs, and on the premise of meeting business needs and high availability. After comprehensively considering comprehensive factors such as technical difficulty, R&D investment, and operation and maintenance costs, design the most suitable for the current scenario. Architecture scheme. Therefore, before explaining the design of the vivo container cluster monitoring architecture in detail, we need to introduce the background:

Production Environment

Vivo currently has multiple containerized production clusters, which are distributed in several computer rooms. At present, the largest scale of a single cluster is 1000~2000 nodes.

Monitoring needs

To meet production high availability, the monitoring scope mainly includes container cluster indicators, physical machine operation indicators, and container (business) indicators. Among them, business monitoring alarms also need to be displayed and configured through the company's basic monitoring platform.

Alerting requirements

Alarms need to be configured visually, sent to the company's alarm center, and have policy rules such as hierarchical grouping.

Data analysis needs

There are various requirements for rich weekly, monthly and quarterly statistical reports.

3.1 Monitoring High Availability Architecture Design

Combined with the environment and requirements described above, the design ideas of vivo's current monitoring architecture:

Each production cluster has an independent monitoring node for deploying monitoring components. Prometheus is divided into multiple groups according to the collection target services. Each group of Prometheus is deployed with double copies to ensure high availability.
VictoriaMetrics is used for data storage. A VictoriaMetrics cluster is deployed in each computer room. Prometheus in the cluster in the same computer room will remote-write the monitoring data to the VM. The VM is configured as multi-copy storage to ensure high storage availability.
Grafana is connected to Mysql cluster, and Grafana itself is stateless, realizing multi-copy deployment of Grafana. Grafana uses VictoriaMetrics as a data source.
The monitoring alarm of Prometheus itself is realized through the dial-test monitoring, and the alarm information can be received in time when Prometheus is abnormal.
The cluster-level alarms use Grafana's visual alarm configuration, and the self-developed webhook forwards the alarms to the company's alarm center. The self-developed webhook also implements alarm processing strategies such as hierarchical grouping.
The monitoring data at the container level (business) is forwarded to Kafka through the self-developed adapter, and then stored in the company's basic monitoring for business monitoring display and alarm configuration, and also stores a statistical report in Druid for more dimensions.

The previous article introduced the community's Cortex and Thanos high-availability monitoring solutions. These two solutions have production-level practical experience in the industry, but why did we choose the Prometheus dual copy + VictoriaMetrics solution?

The main reasons are as follows:

Cortex has less relevant practice documentation available online.
Thanos needs to use object storage. During the actual deployment, it is found that the configuration of Thanos is complicated, and parameter tuning may be difficult. In addition, Thanos needs to deploy its SideCar component in the Prometheus Pod, and our Prometheus deployment adopts the automatic deployment method of the Operator, which is embedded in the SideCar. trouble. In addition, when monitoring the Thanos component in the actual measurement, it was found that Thanos often experienced CPU and network spikes due to compactness and transmission of Prometheus data storage files. After comprehensive consideration, it is believed that the later maintenance cost of Thanos is relatively high, so it was not adopted.
VictoriaMetrics deployment configuration is relatively simple, has high storage, query and compression performance, supports data deduplication, does not rely on external systems, only needs to configure remote-write through Prometheus to write monitoring data, and is fully compatible with Grafana. It not only meets our long-term historical data storage and high availability requirements, but also has low maintenance costs. From our monitoring and observation of VictoriaMetrics' own components, the running data is stable, and since production use, it has been running stably without failure.

3.2 High Availability Design of Monitoring Data Forwarding Layer Components

Since Prometheus adopts a dual-copy deployment high-availability solution, how to deduplicate data storage needs to be considered when designing. VictoriaMetrics itself supports deduplication during storage, so the data deduplication on the VictoriaMetrics side is naturally solved. But how to forward the monitoring data to the basic monitoring platform and OLAP side through Kafka?

The solution we designed, as shown in the figure below, is to achieve deduplication through the "group election" method of the self-developed Adapter. That is, each Prometheus copy corresponds to a set of Adapters, and the two sets of Adapters will select the master, and only the set of Adapters elected as the Leader will forward data. In this way, it not only realizes deduplication, but also borrows K8s service to support the load balancing of Adapter.

In addition, the Adapter has the ability to sense Prometheus failure. When the Leader Prometheus fails, the Leader Adapter will sense and automatically give up the leader identity, thereby switching to another set of Adapters to continue data transmission, ensuring "double copy high availability + deduplication" effectiveness of the program.

4. Challenges and Countermeasures of Container Monitoring Practice

Some difficulties and challenges we encountered in the process of container cluster monitoring practice are summarized as follows:

V. Future Outlook

The goal of monitoring is for more efficient and reliable operation and maintenance, and to detect problems accurately and in a timely manner. A higher goal is to achieve automated operation and maintenance based on monitoring, even intelligent operation and maintenance (AIOPS).

Based on the current experience in container cluster monitoring, the improvements that can be made in the monitoring architecture in the future include:

Prometheus automatic sharding and collection Target automatic load balancing;
AI predictive analysis of potential failures;
fault self-healing;
Set appropriate alarm thresholds through data analysis;
Optimize the alarm management and control strategy.

There is no one-size-fits-all architecture design, and it must evolve continuously as the production environment and requirements change, as well as the development of technology. On the road of cloud native monitoring, we need to continue to stay true to our original aspiration and forge ahead.

Vivo container cluster monitoring system architecture and practice

I. Overview

2. Cloud native monitoring system

2.1 Features and Values of Cloud Native Monitoring

2.2 Introduction to Cloud Native Monitoring Ecosystem

2.3 How to build a simple cloud native monitoring system

2.4 How to build a cloud-native monitoring system with open capabilities, stability and efficiency

3. Vivo container cluster monitoring architecture

3.1 Monitoring High Availability Architecture Design

3.2 High Availability Design of Monitoring Data Forwarding Layer Components

4. Challenges and Countermeasures of Container Monitoring Practice

V. Future Outlook

vivo互联网技术

引用和评论

vivo Pulsar万亿级消息处理实践（1）-数据发送原理解析和性能调优

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

k8s集群部署（一主两从）

黑客眼中的"肥羊"：刚开通的VPS为何最危险？

一体化运维，降本增效！秒云助力某基金打造智能运维平台

分析型数据库入门指南：如何选择适合你的实时分析工具？