Vivo server monitoring architecture design and practice

1. Business Background

In today's era of information explosion, information flows freely around the world with the help of the trend of the Internet, resulting in a variety of platform systems and software systems, and more and more business will also lead to system complexity.

When there is a problem in the core business that affects the user experience, the developer does not find it in time, and it is too late when the problem is discovered, or when the CPU of the server continues to increase, the disk space is full, etc., the operation and maintenance personnel need to find and deal with it in time. This requires an effective monitoring system to monitor and early warning.

How to monitor and maintain these services and servers is an important part that our developers and operation and maintenance personnel cannot ignore. This article is about 5,000 words in length. I will explain the principles of vivo server monitoring and the evolution of the architecture. A systematic arrangement for everyone to refer to when selecting monitoring technology.

vivo server monitoring aims to provide one-stop data monitoring including system monitoring, JVM monitoring and custom business indicator monitoring for server applications, and supporting real-time, multi-dimensional and multi-channel alarm services to help users grasp applications in a timely manner Multi-faceted status, timely early warning and discovery of faults in advance, and detailed data provided after the event to track down and locate problems, and improve service availability. At present, the cumulative number of access business parties for vivo server monitoring has reached 200+. This article introduces server monitoring. Our company also has other types of excellent monitoring, including general monitoring, call chain monitoring, and client monitoring.

1.1 The basic process of monitoring system

Whether it is an open source monitoring system or a self-developed monitoring system, the overall process is similar.

1) Data collection : can include JVM monitoring data such as GC times, number of threads, old and new generation area sizes; system monitoring data such as disk usage, disk read and write throughput, network egress and ingress traffic , the number of TCP connections; business monitoring data such as error logs, access logs, video playback volume, PV, UV, etc.

2) Data transmission : report the collected data to the monitoring system in the form of message or HTTP protocol.

3) Data storage : Some are stored using RDBMS such as MySQL and Oracle, some are stored using time series databases OpenTSDB, InfluxDB, and some are directly stored using HBase.

4) Data visualization : Graphical display of data indicators, which can be line charts, bar charts, pie charts, etc.

5) Monitoring alarms : Flexible alarm settings, and support for various notification channels such as email, SMS, and IM.

1.2 How to use the monitoring system in a standardized way

Before using the monitoring system, we need to understand the basic working principle of the monitoring object, such as JVM monitoring, we need to know the memory structure of the JVM and the common garbage collection mechanism; secondly, we need to determine how to describe and define the state of the monitoring object, such as monitoring The interface performance of a business function can monitor the request volume, time-consuming situation, and error volume of the interface. After determining how to monitor the status of the object, you need to define a reasonable alarm threshold and alarm type. When an alarm reminder is received , to help developers find faults in time; finally, establish a complete fault handling system, respond quickly when an alarm is received, and handle online faults in a timely manner.

2. The architecture and evolution of vivo server monitoring system

Before introducing the vivo server monitoring system architecture, let's take a look at the OpenTSDB time series database. Before understanding, we will explain why we choose OpenTSDB. The reasons are as follows:

1) The monitoring data collection index has a unique value at a certain point in time, and there is no complex structure and relationship.

2) The indicators of monitoring data have the characteristics of changing with time.

3) Based on the distributed and scalable time series database of HBase, the storage layer does not need to invest too much energy, and has the characteristics of high throughput and good scalability of HBase.

4) Open source, implemented in Java, and provides an HTTP-based application programming interface, which can be modified quickly for troubleshooting.

2.1 Introduction to OpenTSDB

1) A distributed and scalable time series database based on HBase, the main purpose is to be a monitoring system. For example, collect monitoring data of large-scale clusters (including network devices, operating systems, and applications), store and query it, support second-level data collection, support permanent storage, do capacity planning, and easily access existing In the monitoring system, the system architecture diagram of OpenTSDB is as follows:

(from official documentation)

The storage structure unit is Data Point, that is, the value of a Metric at a certain point in time. Data Point includes the following sections:

Metric, monitoring metric name;
Tags, Metric tags, used to mark information such as machine names, including TagKey and TagValue;
Value, the actual value corresponding to the Metric, an integer or a decimal;
Timestamp, timestamp.

The core stores two tables: tsdb and tsdb-uid. The table tsdb is used to store monitoring data, as shown below:

(Image source: https://www.jianshu.com )

Row Key is Metric+Timestamp's hourly hour+TagKey+TagValue, which is combined with the corresponding byte map; Qualifier under column family t is the number of seconds remaining on Timestamp's hourly hour, and the corresponding value is Value.

The table tsdb-uid is used to store the byte mapping just mentioned, as shown below:

(Image source: https://www.jianshu.com )

"001" in the figure means tagk=hots or tagv=static, which provides positive and negative queries.

2) OpenTSDB usage policy description:

Do not use the rest interface provided by OpenTSDB, and directly connect to HBase through the client;
The Thrd thread of the compact action is disabled on the engineering side;
Obtain Redis buffered data and write it to OpenTSDB in batches at intervals of 10 seconds.

2.2 Points that OpenTSDB needs to pay attention to in practice

1) Accuracy problem

String value = "0.51";
float f = Float.parseFloat(value);
int raw = Float.floatToRawIntBits(f);
byte[] float_bytes = Bytes.fromInt(raw);
int raw_back = Bytes.getInt(float_bytes, 0);
double decode = Float.intBitsToFloat(raw_back);
/**
 * 打印结果：
 * Parsed Float: 0.51
 * Encode Raw: 1057132380
 * Encode Bytes: 3F028F5C
 * Decode Raw: 1057132380
 * Decoded Float: 0.5099999904632568
 */
System.out.println("Parsed Float: " + f);
System.out.println("Encode Raw: " + raw);
System.out.println("Encode Bytes: " + UniqueId.uidToString(float_bytes));
System.out.println("Decode Raw: " + raw_back);
System.out.println("Decoded Float: " + decode);

As shown in the above code, when OpenTSDB stores floating-point data, it cannot know the storage intention, and it will encounter accuracy problems during conversion, that is, store "0.51" and retrieve it as "0.5099999904632568".

2) Aggregate function problem

Most of the aggregation functions of OpenTSDB, including sum, avg, max, and min, are LERP (linear interpolation) interpolation methods, that is, the acquired values are filled in, which is very unfriendly to the use of null values. For details, see OpenTSDB's document on interpolation.

At present, the OpenTSDB used by vmonitor server monitoring is our modified source code. The nimavg function has been added, and the built-in zimsum function can meet the needs of null value insertion.

3. Strategy for monitoring, collecting, reporting and storing monitoring data

In order to reduce the monitoring access cost and avoid the impact of RabbitMQ reporting failure and CDN synchronization configuration failure on the monitoring system, the collection layer will be directly reported to the agent layer through HTTP, and the queues of the collection layer and the data agent layer will be used to maximize the data during disasters. degree of rescue.

The detailed process description is as follows:

1) The collector (vmonitor-collector) collects and compresses data every minute according to the monitoring configuration, and stores it in the local queue (maximum length 100, that is, the maximum storage of 100 minutes of data). Notifications can be reported by HTTP, and the data can be reported to the gateway.

2) The gateway (vmonitor-gateway) authenticates the reported data and discards it if it is found to be illegal. At the same time, it determines whether the current lower layer is abnormally blown. If it occurs, it will notify the acquisition layer to reset the data return queue.

3) The gateway verifies the version number of the monitoring configuration brought when is reported. If it expires, the latest monitoring configuration will be returned when the result is returned, requiring the collection layer to update the configuration.

4) The gateway stores the reported data in the Redis queue corresponding to the application (the maximum length of the cache queue key for a single application is 1w); after the storage queue is completed, the HTTP report is returned immediately, indicating that the gateway has received the data, and the collection layer can remove the entry data.

5) The gateway decompresses and aggregates the Redis queue data; if the circuit breaker is abnormal, the previous behavior is suspended; after completion, it is stored to OpenTSDB through HTTP; if the storage behavior is abnormal in large quantities, the circuit breaker is triggered.

4. Core Indicators

4.1 System monitoring alarms and service monitoring alarms

After the collected data is stored in HBase through OpenTSDB, distributed computing is completed through the distributed task distribution module. If the alarm rules configured by the business party are met, the corresponding alarm is triggered, and the alarm information is grouped and routed to the correct notifier. Alarms can be sent through SMS self-developed messages, and the personnel who need to receive alarms can be entered by name, job number, and pinyin query. When a large number of repeated alarms are received, repeated alarm information can be eliminated. All alarm information can be recorded through MySQL tables. It is convenient for follow-up query and statistics. The purpose of alarm is not only to help developers find faults in time and establish a fault emergency mechanism, but also to combine monitoring items and alarm sorting services with business characteristics, and learn from the best monitoring practices in the industry. The alarm flow chart is as follows:

4.2 Supported alarm types and calculation formulas

1) The maximum value : triggers an alarm when the specified field exceeds this value (alarm threshold unit: number).

2) Minimum value : Trigger an alarm when the specified field falls below this value (alarm threshold unit: number).

3) Volatility : Take the maximum or minimum value during the period from the current time to the previous 15 minutes and the average value within these 15 minutes to make a floating percentage alarm. The fluctuation needs to be configured with a fluctuation baseline. The “alarm threshold” judgment is made, and the alarm will not be triggered if the value is lower than the baseline value (alarm threshold unit: percent).

calculation formula :

Volatility - calculation formula for upward fluctuation: float rate = (float) (max - avg) / (float) avg;
Volatility - calculation formula for downward fluctuation: float rate = (float) (avg - min) / (float) avg;
Volatility - interval fluctuation calculation formula: float rate = (float) (max - min) / (float) max;

4) Daily chain ratio : Take the value between the current time and the same time yesterday as a floating percentage alarm (alarm threshold unit: percent).

Calculation formula: float rate = (current value - previous period value) / previous period value

5) Week-to-week ratio : Take the value of the current time and the same time of the same day last week as a floating percentage alarm (alarm threshold unit: percent).

Calculation formula: float rate = (current value - previous period value) / previous period value

6) Hour-day chain ratio : Take the sum of the data values from the current time to the previous hour and the sum of the data values in the previous hour at the same time yesterday to make a floating percentage alarm (alarm threshold unit: percent).

Calculation formula: float rate = (float) (anHourTodaySum - anHourYesterdaySum) / (float) anHourYesterdaySum.