Summary: enterprise operation and maintenance requirements and challenges, let's see how Huawei AIOps can solve it!
This article is shared from the HUAWEI Cloud Community " [Cloud Resident Co-creation] AIOps? The new power of enterprise operation and maintenance! ", original author: Qiming.
International practice, let us first introduce the concept of AIOps: AIOps, namely Artificial Intelligence for IT Operations, intelligent operation and maintenance, applying artificial intelligence to the field of operation and maintenance, based on existing operation and maintenance data (logs, monitoring information, application information, etc.) , To further solve problems that cannot be solved by automated operation and maintenance through machine learning.
Gartner predicts that current IT applications will change drastically, and the way the entire IT ecosystem is managed will also change. The key to these changes is what Gartner calls the AIOps platform.
What we are going to discuss today is the demand challenge of AIOps, and how we deal with this challenge.
AIOps needs and challenges
(1) New technologies and new challenges call for highly intelligent telecommunications networks
In recent years, new technologies represented by 5G have been rapidly applied in telecommunication networks. The application of new technologies has brought us a lot of benefits, such as large connections, low latency, high speed, and so on. The development of 5G has improved these data by at least an order of magnitude.
However, the increase in data level is accompanied by an increase in the difficulty of operation and maintenance, which brings the following challenges to operation and maintenance:
1. Network complexity:
The increase in data levels has made the network more complex: new technologies have been rapidly applied, but the old technologies have not been withdrawn simultaneously. As a result, every time we introduce a new technology, we need to add an addition to the original complexity. In some scenarios, even multiplication is required.
For example, in the wireless field, 2G/3G/4G/5G, "four generations in the same house"; in the core network, PS/CS/MS Internet of Things and other ten domains coexist... Such a high network complexity is bound to Will bring considerable challenges to operation and maintenance.
2. 2B New Demand
The second challenge of operation and maintenance is the new scenario of To B, that is, enterprise applications. The application of 5G has promoted intelligent manufacturing, and the network has gradually been integrated into the production and manufacturing process of enterprises. In this case, the requirements for network reliability will inevitably increase. After all, once the network has a problem, the production process may be affected or even interrupted, and the losses caused by this will be very large. 3.
3. Cost pressure
The cost pressure is mainly conducted by the first two challenges. The first two challenges lead us to either face a more complex network or have higher requirements. If we deal with it in the traditional way of operation and maintenance, it will inevitably lead to a sharp rise in costs. Of course, another factor in the increase in cost is energy consumption. After all, the energy consumption of 5G is much higher than that of 4G.
In response to the above-mentioned challenges, how are we going to deal with it? AI technology is the key.
(2) AI is a key technology to enhance the automation and intelligence of telecommunications networks
In terms of operation and maintenance costs, statistics show that 90% of operations and maintenance require manual participation, and 70% of the costs are labor costs. In this case, a natural idea is whether AI technology can be used to reduce human costs and improve operation and maintenance efficiency.
For example, just mentioned the issue of 5G energy consumption, can we use artificial intelligence technology to reduce energy consumption? Judging from past practical experience, the answer to the above question is yes.
Next, we use three examples to illustrate.
1. Base station energy saving
The first example is energy saving in base stations. The energy consumption of the base station is very high. In the initial stage of network deployment, base stations have fewer users, and sometimes base stations are often open. In response to this situation, the operator’s solution is to make some predictions about the volume of dialogue. If we can accurately predict the volume of traffic, then, when the volume of traffic is small, we can turn off a certain amount of carriers, so as to achieve the purpose of energy saving. According to statistics, in the process of predicting traffic, using LSTM neural network to make predictions can achieve energy savings of more than 10%.
2. Core network KPI anomaly detection
The second example is anomaly detection. Deploy KPI anomaly detection services in the operator's core network. The original anomaly detection service uses fixed thresholds for alarm notification. AI technology, on the other hand, can identify abnormalities more intelligently, timely and accurately.
3. Fault identification and root cause location
Usually, once a fault occurs on the network, a large number of alarms will be triggered, and the system will dispatch orders for operation and maintenance with high latitude and longitude. If multiple netizens report multiple alarms, this kind of duplicate dispatch will occur. That is to say, if a failure occurs, multiple network operators report an alarm, which may eventually cause orders to be dispatched in multiple domains (wireless domains and transmission domains, etc.).
(3) The development of AI applications still faces challenges: high development threshold and long cycle
From the above three examples, we can see that AI is relatively reliable. But since AI is so reliable, why hasn't it been fully and quickly applied? Because the development of AI still faces many challenges, a simple summary is six words: a high threshold and a long cycle.
The picture above is a research report by Gartner. It analyzes the main obstacles to AI applications from four dimensions. The three main points are:
- Personnel skills
- Understand gains and uses
- Data scope and quality
This brings us back to the six words we said: high threshold and long cycle.
subtitle
- High threshold
The “high threshold” first point refers to the lack of AI algorithm developers . The general operation and maintenance team will not deploy dedicated AI algorithm developers, which will inevitably lead to the lack of AI skills.
But this is not the most critical, because AI personnel can be solved through training, training, recruitment and other means.
The most critical, is the second point we are talking about, the combination of algorithm and business is difficult . If you want to make an application well, the best thing is to start from the business and choose the appropriate algorithm according to the actual situation of the business, so that the application can be made well. But in the actual operation process, first of all, we need a business expert to have a deep understanding of operation and maintenance; second, we need to have an algorithm expert proficient in AI. After this, they need to have enough time and willingness to sit down and have an in-depth exchange. Here, time and willingness will become obstacles.
third point is data. data contains two problems: engineering problems and labeling problems. That is, the development of an AI application is actually a considerable amount of engineering, because it first needs to access massive multi-modal data to complete the training and inference of the model, and finally to complete the display of the results, including connecting some existing ones. system. Therefore, in addition to the operation and maintenance experts and algorithm experts that are required in the front, a lot of engineering developers are also needed.
2. Long cycle
The high development threshold determines the long development cycle. After all, there is such a high threshold. If it cannot be solved well, the cycle will inevitably be particularly long. A long development cycle will lead to:
First, understands gains and uses. How to understand In other words, if we do not get results for a long time, then corporate decision-makers may doubt the effects of AI;
Second, the time is, the higher the expectations for the project will be. assumes that the same effect is achieved by doing one thing. For example, the time to repair the fault is reduced by 5%, and the evaluation may be completely different for the one made in two years and the one made in one month.
In response to the challenges encountered in the process of AIOps landing, Huawei launched the AIOps service! Now let's take a look at what the AIOps service is and how it solves the challenges we face in front of us.
Huawei AIOps Service
The picture above is the overall framework of the AIOps service. AIOps is divided into four layers from bottom to top:
first layer: data collection and management. Data collection management sounds easy but difficult to do. Why? Because there are many data types to face, the interfaces and data types are not uniform. Just adapting to these data may be exhausted. Relatively speaking, Huawei's AIOps service first supports common interfaces, and then some common equipment has been preset, and finally it can reach a level of automatic docking and automatic data management.
second layer: AI atomic capabilities. Huawei AIOps has more than 20 atomic capabilities, covering four scenarios: detection, prediction, identification, and diagnosis. Atomic capabilities are not just an implementation of AI algorithms. Each atomic capability has been verified by actual site data and optimized for specific operating scenarios. At the same time, each atomic capability is also integrated into Huawei's previous operation and maintenance experience, and some atomic capabilities can even be used directly without training.
third layer: orchestration capabilities. includes process layout and large screen layout, as well as RPA layout. Atomic capability is the basic component of AIOps intelligent operation and maintenance. The process orchestration operation is simple and flexible. You only need to drag data from the component library and combine with AI operation and maintenance capabilities to complete the end-to-end graphical orchestration of command scenarios, which truly supports partners Lower the development threshold and build an AI application orchestration framework efficiently.
Fourth layer: industry AI app. is used out of the box for the most typical scenarios. Through rich 2D and 3D visualization components, for example, it provides more than 30 chart controls, covering styles such as polyline, topology, list, and column, and provides multiple map controls, interactive controls, and media controls. When the operation and maintenance effect is large, you only need to drag various controls from the component library, combine free layout and flexibly configure various reports of the application as needed, and assist in monitoring and analysis, such as DIY microservice health monitoring hall, so that it can be visualized , Show the average success rate of the interface, the average delay of the interface, the failure rate of the interface, the number of interface calls, etc. At the same time, it provides a list of KPI alarms to provide operators with a reference basis for fault early warning, drag and drop the required control number, and customize the style, data and interaction of the control to meet the display requirements. The back-end data can also use various intermediate data defined in the app combination process. After the configuration is complete, you can preview and publish the operation and maintenance effect with one click, display the interface on the large screen, the average success rate, the average delay of the interface, the failure rate of the interface, the number of interface calls, etc., quickly realize the DIY visualization large screen.
(1) RPA helps AIOps connect with existing operation and maintenance systems
In addition to the display position, the inference result must be able to assist in the recovery of the failure. At this stage, it is generally to interface with existing systems, such as work order system (persons who need work order mailboxes to process), automatic replies, and problem orders. If the docking is done manually, it is time-consuming, laborious and error-prone. Therefore, robotic process automation, that is, RPA service, is a matter of course. The RPA service can complete data docking, handling and work order issuance, etc., reducing manpower input and reducing error costs.
(2) 10+ out-of-the-box apps that support rapid deployment
For some of the most typical scenarios, Huawei Cloud AIOps has prepared its orchestration capabilities in advance, that is, has more than a dozen out-of-the-box apps , such as campus networks, DC networks, IT applications, operator networks, etc. Full coverage of scenarios; flexibly deploy , support public cloud, HCS deployment, On Premise deployment, and cloud-ground collaboration, etc.; open ecology , support partners to develop industry apps and release AI applications to the AI market, win-win cooperation , To build a network AI ecosystem.
Below we use the "KPI Anomaly Detection" App to demonstrate how to use an out-of-the-box App.
Step 1: Import the list of network elements;
Step 2: Configure performance and alarm data sources;
Step 3: Associate the data source to the App;
Step 4: Start the App;
Step 5: Check the big screen and analyze the fault.
AIOps enables intelligent operation and maintenance of campus networks
So how does AIOps solve the actual operation and maintenance in the park?
(1) Campus network construction and maintenance mode
The above picture shows the two construction and maintenance modes of the campus network:
2B and 2C share the OMC of the big network: current mainstream model. The enterprise rents the wireless equipment of the operator and some other equipment. The problem with this model is that the terminal is maintained by the enterprise and the network is maintained by the operator. It is difficult to distinguish responsibilities when a problem occurs. Another problem is the operator’s operation and maintenance capabilities and the organization to build the O domain of the large network 2C. It is difficult to support the high SLA of the enterprise intranet and strengthen the demands of customers.
2B and 2C separate OMC (EMS): companies purchase 5G CPE, wireless, core network and other equipment for maintenance, with an end-to-end view. Judging from the documents issued by the Ministry of Industry and Information Technology, VDF, Audi Park and corporate SLA guarantees, companies renting operator spectrum or dedicated spectrum self-built 5G networks will gradually become mainstream.
(2) Business scenario and pain point analysis: Park customers need easy-to-use, multi-domain integrated network operation and maintenance
1. Typical network status
The picture above is a common video detection service in a park. We can see that even for the most common business, about a dozen network elements will participate in it, from 5G wireless to transmission to edge computing, and even the core network.
2. Park application
The above figure lists some common applications in the park, including edge AI detection, intelligent logistics, indoor positioning, etc. All these businesses are actually similar to the previous picture, that is, any simple business involves the participation of multiple domains.
So what is the difference between the park and the operator's operation and maintenance? There are three main points:
users: lacks professional communication knowledge and weak network operation and maintenance capabilities;
network: The network is relatively simple, but involves multi-domain, wireless, transmission, data communication, IT, etc.;
SLA: production system network end-to-end SLA contract requirements are high, 7X24 hours, 99.99%.
Therefore, if the customer is operating in the park, the pain points are as follows:
skills: The 5G 2B makes the network more complicated, and enterprise engineers lack relevant skills and difficult operation and maintenance;
tools: lacks effective operation and maintenance tools. The location of complex network problems requires on-site consultation with cross-domain experts, which is costly and time-consuming.
In summary, the campus network cross-domain equipment needs to realize data integration, support end-to-end analysis and presentation, and finally realize the unified operation and maintenance of enterprise ICT infrastructure. The campus network involves a lot of network equipment, and the boundary is blurred. It needs a unified cross-domain demarcation and positioning capability to speed up the positioning of production network problems.
(3) Traditional manual and tool-based operation and maintenance cannot meet the new needs of park networks, and there is an urgent need for intelligent transformation
According to the data in the above figure, we can see:
passive operation and maintenance: 75% of the problems are discovered by users rather than actively detected. If they are discovered by users, users are likely to complain;
has low degree of automation: enterprise costs are labor costs, and the cost has increased sharply;
troubleshooting difficulties: 90% of the failure recovery time is used to locate the problem, the real problem repair time is very small.
In this way, regardless of whether it is considered in terms of efficiency or effectiveness, has an appeal that is to introduce artificial intelligence to solve problems and enable the automated closed loop of network operation and maintenance prediction, analysis, and decision-making.
(4) Flow of cross-domain fault location algorithm
The above figure is the algorithm flow of cross-domain fault location. The whole process is as follows:
input:
- Alarm: the alarm reported by the device;
- Topo: Network Topo structure;
- Fault propagation diagram: the influence relationship between alarms.
process introduction:
- Noise reduction: filter out the large number of and invalid alarms in the original alarms, such as flashes and earthquakes;
- Aggregation: Divide the alarms, separate Topo unrelated alarms, and aggregate the alarms that may be related (belonging to the same fault) together to obtain multiple alarm groups;
- Identify and locate: Analyze each alarm group in combination with Topo and the fault propagation diagram, and identify how many faults in each alarm group, the root cause network element and root cause alarm of each fault;
- Diagnosis: Diagnose the type of fault for each fault alarm, for example: power interruption.
output:
- Root cause of failure
- Alarm of failure design
- Fault type
- Failure recovery suggestions
(5) AIOps framework implementation algorithm flow
The above explained the entire algorithm flow. Next, let's take a look at how to use the Huawei AIOps framework to implement the algorithm flow.
1. Quickly configure data sources and orchestrate processes
Configure the data source: access alarms from multiple domains such as wireless, transmission, and core networks, and access network topology data;
Process orchestration: Commonly used existing atomic capabilities to quickly perform process orchestration.
After the above process, the "event notification" function can be completed, and the results can be saved to the record set (ie, the database) for large-screen display. The renderings are as follows:
Open one of the alarms, and you can see the following information:
AIOps deployment recommendations
Based on the aforementioned practice, we can summarize the following:
1. Select mature scenarios and deploy AIOps step by step
After long-term practice, we have summarized the main reasons for the failure of AIOps deployment as follows:
data does not come up: data is scattered on various independent systems, lacking comprehensive collection and management methods. Missing data and low data quality are the main reasons for the poor performance of AIOps;
command does not go: lacks automated operation and maintenance tools, and cannot perform active detection and recovery operations;
model is not intelligent: cannot effectively accumulate the annotation information in daily operation and maintenance, and cannot realize the self-learning of the model.
Therefore, based on the failed deployment, we can conclude that if we want to successfully deploy AIOps, we need to:
Starting from mature scenarios with conditions, advance AIOps deployment step by step;
- data, comprehensive collection of various operation and maintenance data to improve data quality;
- command can be issued, AIOps back-end docking is now automatic operation and maintenance tools, enhanced diagnosis means and automatic recovery capabilities;
- effectively accumulates labeled data, and allows the AIOps model to continuously receive feedback and has self-learning capabilities.
2. Choose mature AIOps services
For different types of enterprises, the selection of AIOps services is also different, as shown in the following table:
Huawei's AlOps service lowers the threshold for network AI application development and accelerates the implementation of network AI applications. It has accumulated 10+ out-of-the-box smart APPs, covering application areas such as operator networks, campus networks, data center networks, and IT applications. Pre-integrated rich AI atomic capabilities, covering fault prediction, detection, diagnosis, identification and other links. Support users to develop AI applications with zero coding to improve operation and maintenance efficiency.
interested, click here to experience it together~
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。