头图

Author |
Review & Proofreading: Bai Yu
Editing & Typesetting:

After accompanying many companies to experience business cloud and cloud native, we can see that the process of building an operation and maintenance monitoring system for each company is very difficult. This is due to the rapid development of enterprise business and the increasingly stringent and complex IT requirements. This is not only reflected in the operation and maintenance team structure and workflow, but also in tool selection and platform construction. Although companies of different sizes and stages need to face a variety of practical problems, there are still some best practices to follow. Today, let's talk about tool selection and platform construction ideas and key points of practice.

The inevitable trend of tool selection and platform construction

It should be particularly noted that the monitoring platform does not just simply download an open source monitoring tool. It needs to be integrated and re-developed according to the characteristics of the monitored business to match the actual business situation. After a lot of practice, we found that the requirements and development direction of the monitoring system that are common in enterprises:

  • Automatic recognition and collection

Cloud native brings a cross-technology stack and highly dynamic technology architecture. Therefore, facing the complex and changeable monitored environment, the collector can automatically recognize the environment as much as possible, and the autonomous collection of indicators becomes the beginning of everything. No data can be collected, how to monitor?

  • Continuously strengthened data management capabilities

The emergence of clouds, containers, and microservices has increased the number of monitored objects by several orders of magnitude. When the business is developing rapidly and facing hundreds of millions or even billions of time series data, how should we manage it?

  • Data Kanban system has become a rigid demand

With the explosive growth of data volume, traditional data display methods such as line graphs, histograms, and scatter graphs are difficult for operation and maintenance personnel to find anomalies or hidden bottlenecks behind the data. How to find a more suitable data board and display form for different businesses or different monitoring objects has become a compulsory course for every operation and maintenance personnel.

  • Central and Taiwan hub role

With the rapid development of technology, the role of the monitoring system in the central platform of the overall operation and maintenance system has become more and more obvious, and the operation and maintenance monitoring has changed from a traditional process-driven to a data-driven. How to more conveniently integrate with many other operation and maintenance subsystems is also a problem that the operation and maintenance team needs to consider at the beginning of the monitoring system.

The evolution of the enterprise monitoring system

Combining the above characteristics, we say that the evolution of the enterprise monitoring system can be summarized into the following stages.

Promotion period: between 50 and 100 servers

At this stage, due to the small number of servers and small business scale, the monitoring requirements of the operation and maintenance team are relatively simple. Able to realize basic notification problems, quickly locate and solve problems. The platform construction at this time is mainly to allow students from research and development, operation and maintenance to gradually become familiar with the use of the product, and through experience and feedback to confirm whether they meet the needs of enterprise IT operation and maintenance and business characteristics. Several key features include:

(1) Simple deployment, mature documentation and service system, easy to use;

(2) Stable operation, SLA guarantee;

(3) The notification form of the alarm system does not need to be too rich, but it is relatively timely and available;

(4) Low cost or free.

Based on the above requirements, many start-ups may choose Nagios, Cacti, Zabbix, Ganglia and other open source tools. The documentation of popular open source monitoring products is relatively complete, which can be used quickly and has a large number of corporate practices for reference. But the problem here is that the performance and usage scenarios of open source products cannot satisfy the development of business scenarios and the growth of business volume, and various problems arise. At the same time, high availability has become a fatal problem. After all, the open source community does not always have volunteers to help us troubleshoot.

Outbreak period: between 200 and 1000 servers

At this stage, as the number of servers has increased, the technical architecture has changed, and components have become more abundant, monitoring requirements have also begun to become more complex. However, in the face of many service modules or operation and maintenance systems, we need to connect in batches and orderly, and under the premise of ensuring stability, we need to quickly increase the volume and unify the technology stack. The monitoring system is mainly used for alarm notification, to find problems and prevent the same problems from recurring. This has several key features:

(1) Summary and classification of monitoring content

As monitoring objects and information increase with the expansion of technical architecture and business scale, it is necessary to achieve full coverage monitoring for different dimensions of data such as software, hardware, and business. And for different monitoring purposes, the monitoring needs to be classified and summarized, such as system basic monitoring data, network monitoring data, and business monitoring data. Monitor and cover as much as possible, find important problems as soon as possible, and ensure the stable operation of the business.

(2) A variety of warning methods, no missing reports in time

The monitoring objects are classified according to their importance and urgency, and alarm notifications are made through different levels and different methods such as email, WeChat, SMS, and telephone. Each monitoring corresponds to a different responsible person to ensure that each alarm has someone to follow up and deal with it in time.

(3) Alarm strategy optimization and information convergence

As more and more services need to be monitored, the number of alarm messages has surged, and thousands of alarm emails may be received every day. Too much warning information loses the meaning of accurate notification. How to configure and optimize the alarm policy, and minimize unnecessary alarm emails, becomes the core of the policy setting.

Maturity period: more than 1,000 servers

As the business continues to grow, the demand for servers is increasing. When the number of servers exceeds 1,000, it means that all core systems need to be connected and a new stability guarantee system must be built, including monitoring the market, warning notifications, and emergency duty. In order to ensure the stability of the entire business and technology market. Among them, you need to pay attention to:

(1) Monitoring delay and alarm lag

When the business scale gets older and bigger, due to the coupling of components or services, it is likely that the entire business system will be paralyzed due to small local faults. Therefore, timely detection of problems has become a major prerequisite for everything. But if you are still choosing an open source product, there may be no small trouble at this time. Take Zabbix as an example, when the scale reaches a certain amount, sometimes the monitoring data cannot be displayed in time, alarm delays and other problems occur. We can indeed adjust through various optimization methods. However, the losses caused by business problems cannot be recovered.

(2) SLA of the monitoring system itself

When the collection of operation and maintenance data grows rapidly, the high availability of the monitoring system itself has also become an important concern. After all, the loss of the monitoring system means the loss of control over the operation of the entire technology and business.

More cost-effective solution: application real-time monitoring service ARMS

Facing the above-mentioned pain points at different stages, ARMS has become the best solution. At the same time, Alibaba Cloud launched the ARMS 3.0 inclusive program aimed at helping different types of users obtain a more cost-effective observable experience at a more reasonable cost through a more flexible billing scheme. The application monitoring basic version (pay-as-you-go) model to be launched in October 2021 supports 0 yuan usage: indicators are stored for free for 3 days, and the basic sampling of the call chain is stored for 1 day. The functions are consistent with the original basic version. Extend the storage period or increase the link sampling by paying for the amount of money. For details, please refer to the function list of the basic version of application monitoring or product billing instructions.

图片 1.png

According to user demands in the above phases, ARMS 3.0 Application Monitoring has introduced a supporting flexible billing strategy:

(1) Trial period: ARMS provides new users with 15 days of free use, and comprehensively evaluates the degree of compatibility between ARMS products and business.

(2) Promotion period: ARMS provides free quota for the basic version, free storage of application monitoring indicators for 3 days, and free storage of basic sampling in the call chain for 1 day. The zero threshold is used indefinitely, so don't worry about the cost during the promotion period.

(3) Outbreak period: ARMS basic version supports billing based on traffic, and can adjust the sampling rate of the call chain of the specified application as needed, or extend the storage period.

(4) Maturity period: freely choose to charge by traffic or by node according to the type of business traffic.

Charge according to traffic, count as much as you use

With the popularity of microservices and Kubernetes, microservices are split more and more finely, and the traffic of a single Pod is getting smaller and smaller. The node-based billing model is not flexible enough. Under the circumstance of constant business flow, the rapid growth of cost with the scale of nodes is obviously not reasonable enough.

In order to solve the problem of observable cost for users with small and flexible traffic, ARMS 3.0 introduced the basic version of application monitoring (pay-as-you-go) mode: the basic sampling of the calling chain is stored for free for 1 day, and the paid sampling link is charged at 0.2 yuan/(million). Each Trace*day) will be charged. A single Trace can contain up to 10 Span calls, and the excess part will be converted proportionally. The indicator data is free for 3 days, and the storage period can be extended by paying on-demand, as shown in the following table.

图片 2(2).png

Take a basic version of ARMS user as an example. The user has created about 300 Pods, the total original call volume is about 5.4 billion times/day, the call chain sampling rate is 10%, and the actual storage volume is about 54 million Trace/day. According to the original basic version link storage for 1 day, and the indicator storage for 3 days, the upgrade to flow accounting can save more than 90% of the cost.

图片 3.png

Large traffic, more cost-effective billing by node

Some ToC types of business traffic are very large, and have high requirements for the traceability of the problem, requiring long-term storage. At this point, you can choose the ARMS Expert Edition node-based billing model, link storage for 30 days, indicator storage for 90 days, all-inclusive price, capped, and more suitable for access to high-traffic core applications. The expert version can also enjoy a half-price discount for container service ACK or EDAS users, and the purchase of a prepaid data package can be as low as 1.308 yuan/(probe*day). For details, please refer to the ARMS product price description.

common problem

Q: How can new and old users upgrade to the new basic version (pay-as-you-go) mode of application monitoring?
A: After October 2021, after the trial period for new users is over, choose to activate the basic version and enter the pay-as-you-go billing mode by default; users of the existing basic version can click to upgrade to the new billing mode at the top of the application monitoring -> application list page. The free sampling of the new basic version of the link relies on the upgrade of Agent to version 2.7.1.3. You can select the corresponding area to download on the application monitoring -> Agent list -> java version description page, https://arms.console.aliyun.com/# /tracing/agentList/cn-hangzhou .

Q: Is the new basic version (pay-as-you-go) free by default? How long is it free?
A: After the new basic version (pay by volume) is opened, it is completely free by default. If you do not adjust the storage period or call chain sampling rate, you can use it for free indefinitely, which is very suitable for small traffic or test application access.

Q: What features does the basic version include? What is the difference with the open source and expert version?
A: The basic version supports basic APM functions such as call chain, service monitoring, JVM/host monitoring, and alarms, which are basically the same as open source capabilities. The expert version will be greatly enhanced in memory/thread/abnormal diagnosis, billing by node, call chain storage for 30 days, and indicator storage for 90 days, which is more suitable for high-traffic or core production applications.

Q: In addition to application monitoring, do ARMS front-end monitoring, cloud dialing testing, and Prometheus monitoring support pay-as-you-go billing?
A: ARMS front-end monitoring, cloud dial test and Prometheus monitoring all support pay-as-you-go billing, and discounts can be obtained through prepayment. For details, please refer to the ARMS product price description.

Related Links:

1) Application monitoring basic version function list:
https://help.aliyun.com/document_detail/65682.html

2) Product billing description:
https://www.aliyun.com/ntms/price/detail/arms_detail

3) ARMS product price description:
https://www.aliyun.com/ntms/price/detail/arms_detail

Click on the link below to learn more about Double Eleven offers!
https://www.aliyun.com/activity/1111/cloudnative?spm=5176.20960838.0.0.7603305eAKBvI9


阿里云云原生
1k 声望302 粉丝