头图

图片

This article is the content of the " Dev for Dev Column " series. The author is Huang Nanxun, a big data algorithm engineer at Shengwang.

01 Introduction to automatic operation and maintenance

图片

In 2016, Gartner innovatively proposed the concept of AIOps [1] , creating a new chapter in artificial intelligence-assisted operation and maintenance decision-making.

The full name of AIOps is Artificial Intelligence for IT Operations, which is artificial intelligence for IT operation and maintenance services. Traditional operation and maintenance methods often rely on several operation and maintenance personnel with professional knowledge to monitor and make decisions on services in a specific scenario. With the growth of the company's volume and the exponential growth of business scenarios and quantities, traditional operation and maintenance will face the problems of long decision-making time, difficult decision-making, and high labor costs. Once a major decision-making error occurs, it may cause huge business losses. However, massive amounts of data are exactly what machine learning excels at.

A set of mature machine learning algorithms can accumulate judgment experience from operation and maintenance operations, continuously monitor and analyze data, and provide valuable information for operation and maintenance decisions.

02 Automatic O&M in SD-RTN™ Scenario

1. Scenario introduction

SD-RTN™, the full name of Software Defined Real-time Network, is a software-defined real-time network designed by SoundNet for two-way real-time audio and video interaction.

The core of its realization is an audio and video transmission network built by computer rooms all over the world. Each computer room is responsible for sending and receiving in the process of information transmission. All audio and video quality passing through these equipment rooms will be collected and reported in a certain way for real-time quality monitoring. Once these indicators reflect that there are unacceptable problems in calls passing through a certain computer room, corresponding operation and maintenance operations need to be performed on the computer room to ensure high-quality audio and video experience for users.

The traditional operation and maintenance method uses absolute water level or logical conditions to monitor the quality of the equipment room. Although this kind of monitoring can identify some quality anomalies, it has problems such as serious missed alarms and false alarms, and a single dimension. It lacks the ability to distinguish alarms close to the threshold. , it also lacks the ability to identify abnormal transmission quality index curves.

With the cooperation of business, algorithm, data, operation and maintenance teams, Shengwang has created a set of exclusive SD-RTN™ AIOps framework, which gradually replaces manual operation and maintenance with machine learning, creating a fast and reliable automation. Operation and maintenance process .

2. Full link display

图片

The current AIOps process is shown in the figure. Large-scale computer room data is processed and stored by the data center through data reporting, and the big data algorithm platform reads the data stream to realize real-time abnormal monitoring of computer room-level and regional-level data. At the same time, enable the quality recovery detection to monitor whether the quality of the abnormal equipment room is recovered. The automatically disabled and restored data will be stored in the algorithm platform as sampling data to detect the effect of the algorithm and provide a continuous data source for the subsequent training of the algorithm.

At present, the algorithm has achieved high-quality transmission rate quality detection at second-level granularity and minute-level granularity , computer room link detection and computer room memory overflow risk detection, and comprehensive monitoring of massive computer rooms from multiple dimensions.

Once the quality of the equipment room is abnormal to a large extent, the algorithm can ensure that the entire link responds in a timely manner within tens of seconds, perform automatic operation and maintenance operations for the equipment room, and perform automatic recovery operations for the equipment room in a timely manner according to the quality recovery situation. At present, the algorithm performs an average of 50 to 100 automatic operation and maintenance operations per day, which basically completely replaces manual operations. The accurate recall of perceptual abnormality in the computer room exceeds 97%, and the traffic re-connection is fully realized within ten minutes after the fault is recovered, reaching The level of refined operation and maintenance.

The optimization of the whole link is also in progress this year. The algorithm team is committed to the automatic deployment and automatic operation and maintenance of the algorithm, speeding up the update and iteration of the algorithm model, improving the self-recovery capability of the algorithm failure, and facilitating the operation and maintenance of the operation and maintenance team; the data platform will Build a high-availability data center to ensure high availability of data sources throughout the year; the operation and maintenance platform will build a programmable operation and maintenance platform to realize the closure of operation and maintenance operations; the algorithm judgment results will be transmitted in the form of information flow, so as to realize the alarm of each alarm. The entire link can be traced to create high-performance, highly robust automated operation and maintenance products.

3. Algorithm introduction

图片

The algorithm team and the business side work together to label and mine a large number of abnormal data in the computer room through the algorithm labeling platform developed by the algorithm team, classify the abnormal quality curves according to their characteristics, and develop a specific identification scheme for each type. .

Once an anomaly is identified, the algorithm will further calculate the probability of each manufacturer's impact on the overall quality curve based on the characteristics of the curve shape and other characteristics, so as to avoid misjudgment caused by a single manufacturer with too much occupation on the overall curve.

At the same time, the algorithm will also drill down to the regional level. Once a user in a certain area is connected to a specific equipment room with a large area of abnormal quality, a special alarm mechanism will be triggered for subsequent processing.

图片

The link detection of the computer room is detected in the form of packets, and the health of the computer room is represented by the health status of all the packets departing from and arriving at the computer room.

The algorithm team developed an abnormal state baseline to judge the quality of the computer room. If there is a large-scale overall abnormality or a small-scale large-scale abnormality in the incoming and outgoing packets of the equipment room, the abnormal value will be superimposed; if it is completely stable, the abnormal value will be reduced; when the abnormal value exceeds the system baseline, an alarm will be triggered and automatic operation will be triggered. dimension operations.

图片

The computer room memory detection uses a variety of filtering and smoothing methods, combined with business logic to find the breakpoint of the memory change curve, predicts the future memory capacity from the breakpoint, identifies the machine that will overflow memory, and issues an alarm notification.

03 Automatic operation and maintenance in RTSC scenario

1. Scenario introduction

Real Time Streaming Center (RTSC) is a service that processes real-time media streams in the cloud and publishes them to different platforms. It can be processed based on RTC media streams to build various technical scenarios such as cloud recording, bypass live broadcast, cloud confluence, cloud screenshots, and input online media streams.

It also supports external media source input and processing. Services such as RTSC streaming and cloud recording mainly rely on the quality of information transmission between machines and the quality of the machines themselves. If a machine fails, it will affect the media streaming service on the entire link.

2. Algorithm introduction

The idea of anomaly detection of streaming service machine quality is basically the same as that of large network transmission quality detection. The business push streaming service is located at the end of the large network transmission. In terms of data processing, we filter out RTSC-related business scenarios, and transfer the object of interest from the sender to the receiver. We have obtained a large amount of RTSC room transmission quality data to support the algorithm. Anomaly detection is performed.

图片

The cloud recording service involves connections from gateways to edge nodes. Once these links have large-scale errors, it often means that the computer rooms or machines of some gateways or edge nodes are unavailable. machine, which will affect the quality of the cloud recording service.
The quality of the cloud recording service is mainly reflected in the number of incorrect connections from the gateway to the edge node. There is a relatively clear threshold for business, and the alarm can be controlled according to the traditional threshold method. However , due to the contingency of abnormal time and scale, traditional methods may fail to respond in time or accurately locate the source of errors.

The algorithm team and the business side worked together to create the RTSC-AIOps process. This process is centered on the graph algorithm and combined with business logic, which can quickly locate the abnormal machine in the computer room. At present, it has completely taken over the prohibition/enablement process of the cloud recording edge node, realizing the rapid detection and processing of abnormalities within one minute, and the accuracy rate reaches More than 95%, saving more than half of the manpower, effectively improving the efficiency of RTSC business operation and maintenance, and ensuring the stable operation of the business.

04 Conclusion

This article introduces the rapid and accurate automatic operation and maintenance service that is driven by AI, supported by big data, and guided by business needs, created by the big data algorithm team of Shengwang through close cooperation with various teams.

In the era of intelligence, the explosive growth of information has made traditional operation and maintenance, decision-making, analysis, and services incompatible with the environment, and algorithms exist to solve these problems. The training of the algorithm relies on high-level information providers, which is a summary and extension of experience, and is a "god's perspective" to look at the overall situation.

With the continuous increase of algorithm landing scenarios, Shengwang will also devote more energy to the exploration of unknown fields, using the complementarity of AI and manpower to provide developers and users with more stable and higher-quality products and services.

[1] "Gartner says Algorithmic IT Operations Drives Digital Business" https://www.gartner.com/en/newsroom/press-releases/2017-04-11-gartner-says-algorithmic-it-operations-drives-digital -business

About Dev for Dev

The full name of the Dev for Dev column is Developer for Developer. This column is a developer interactive innovation practice activity jointly initiated by Shengwang and the RTC developer community.

Through various forms of technology sharing, communication and collision, and project co-construction from the perspective of engineers, the power of developers is gathered, the most valuable technical content and projects are mined and delivered, and the creativity of technology is fully released.


RTE开发者社区
663 声望973 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。