Introduction to Network Diagnostic Tool SreCli-Net
1. background
The SRE operation and maintenance team is committed to improving the efficiency of operation and maintenance through automation, promoting the iterative transformation to intelligent operation and maintenance, and solving the pain points of traditional operation and maintenance. Although the traditional operation and maintenance has a complete operation and maintenance system, the operation and maintenance methods are different, and the operation and maintenance operations are complicated and time-consuming. How to improve the operation and maintenance efficiency of hybrid cloud projects, increase the added value of operation and maintenance and customer satisfaction, is still a tough problem for us.
The main challenges are as follows:
- With the rapid development and evolution of customer services, the lag in traditional operation and maintenance has been enlarged
With the development of customer services and the continuous evolution of business models, the volume of business data is also increasing year by year. This brings more opportunities and challenges to operation and maintenance. How to ensure the stable, safe, and efficient operation of data in the cloud and business interactions inside and outside the cloud is a question worth considering for operation and maintenance personnel.
- The operation of each system of the platform is complex, and the cost of operation and maintenance learning increases
With the rapid iteration of cloud platform cloud product versions, it is more difficult to become familiar with the platform. With the change of cloud product versions and the emergence of new functions, the cost of learning for novices increases, and the difficulty of familiarizing with the various operation and maintenance operations of the cloud platform has increased. But it cannot fundamentally solve the problem of rapid empowerment of operation and maintenance capabilities. All of this will trigger a series of "butterfly effects", and even cause high project risks or P-level failures, which will directly affect the normal use of customers' cloud services.
- The ability of operation and maintenance personnel is uneven, and operation and maintenance operations are complicated
The current operation and maintenance methods have major problems such as manual experience judgment, various manual operations on the platform, inefficiency in handling problems, and long time-consuming emergency handling of faults. Due to the complexity of the system, technicians will waste a lot of time in guiding basic issues such as machine login and tool use when operating the platform. After logging in, I face the inconsistency of various addition, deletion, modification, and inspection operation instructions. With the long-term consumption of operation and maintenance, it will also cause on-site operation and maintenance personnel to be exhausted and unable to concentrate on online operations. Especially in the face of some inexperienced resident or customers, it often happens that the target machine is not found, the order is typed incorrectly, etc., which makes the overall operation and maintenance inefficient and frequent security risks.
Based on the above three aspects of operation and maintenance issues concerning customers, platforms, and operation and maintenance, improving the efficiency of operation and maintenance and reducing the learning cost of operation and maintenance personnel are the main tasks at present. In this context, the SRE-CLI tool was launched, which is a srecli tool that supports shell functions, command completion, problem diagnosis, and fault hemostasis, to gradually solve and improve the current status of the problem.
2. SRE-CLI basic introduction
SRE Command Line Interface (SRE CLI) is an operation and maintenance tool that allows you to use commands in the command line shell to perform operation and maintenance operations on the hybrid cloud. With minimal configuration, you can use the SRE CLI to run commands, so as to implement complex commands in the daily operation and maintenance process from the command prompt in the terminal program. Based on the experience of the "old Chinese doctor" in the daily work of problem handling and failure emergency precipitation, and integrated in the hybrid cloud through command line tools, you can run the SRE CLI without configuration, and realize daily operations through simple commands. Complicated operations in the maintenance process.
The CLI interaction capability model is mainly composed of four parts: access layer, interaction layer, back-end, and infrastructure. First, after logging in to SRECLI, the end user enters the interactive layer interface and completes the specified action by selecting the corresponding scene command and auxiliary function. This action will call the back-end tool capabilities and data in the data source, and calculate through the infrastructure layer. The result of the calculation and diagnosis will be directly output to the black screen interface of the terminal CLI to complete an entire interactive process, as shown in the figure below.
figure 1
- Problem diagnosis (ali\_diag)
Refine high-frequency operations from service orders, work orders, and trouble tickets, and transform common operations, problems & failure points into atomic items. Query product atomic items, problem points, fault points, and fast query key indicators to locate problem points through daily operation and maintenance.
figure 2
- Scene diagnosis (ali\_scene)
A series of troubleshooting ideas are precipitated in the fault scene, and output in the form of "three sharp axes" to accurately locate the problem. On this basis, the fault point is assembled and the fault is accurately located.
image 3
- Emergency hemostasis (ali\_cure)
The true failure and risk hemostasis recovery means precipitate, and after the occurrence and the solution are determined, a quick recovery is required. The recovery actions include restarting, downgrading, current limiting, switching, etc. Help customers quickly recover their business.
- Daily query (ali\_query)
Daily query, related data display, common information acquisition, through accurate query methods, query the product, routing, capacity, strategy and other information corresponding to the IP address in the cloud. Currently covering various IP dimension queries of physical networks.
- Intelligent stream capture (ali\_trace)
Satisfy the CLI's ability to capture packets at various points in the cloud platform, through customized packet capture combination commands, quickly land at the packet capture point, and capture the network traffic in the specified in or out direction. Covers two types: classic network type packet capture and VPC network type packet capture.
3. Cli-Net concept
Cli-Net is a branch function in the CLI system, which is mainly responsible for processing the diagnosis and troubleshooting of the physical network direction in the hybrid cloud. Through the unified format of instructions, specific aspects of the diagnosis and diagnosis are performed in the physical network environment. Cli-Net mainly covers four aspects of the hybrid cloud physical network, including the performance diagnosis of general network equipment in the cloud, the diagnosis of the state of the cloud boundary network, the diagnosis of the state of the network in the cloud, and the diagnosis of the network state of the physical machine. It involves the operating status of physical machines and switch networks of all products in the cloud, as well as the troubleshooting and diagnosis of the Internet and IDC networks outside the cloud accessing the intra-cloud network. The specific diagnosis coverage is shown in the following table.
<span class="lake-fontsize-10"> <span>Cli-Net</span> </span><span class="lake-fontsize-10 <span size-10"> 160e511b01daa0 </span> </span> | <span class="lake-fontsize-10"> <span>General network equipment performance diagnosis</span> </span> | <span> fontsize-10"> <span>Cloud boundary network status diagnosis</span> </span> | <span class="lake-fontsize-10"> <span>Cloud network status diagnosis</span> </span> | <span class="lake-fontsize-10"> <span>Physical machine network status diagnosis</span> </span> |
10">ISW</span> | <span class="lake-fontsize-10">●</span> | <span class="lake-fontsize-10">●</span> | <span class="lake-fontsize-10"> </span> | <span class="lake-fontsize-10"> </span> |
<span class="lake-fontsize-10" >DSW</span> | <span class="lake-fontsize-10">●</span> | <span class="lake-fontsize-10"> </span> | <span class=" lake-fontsize-10">●</span> | <span class="lake-fontsize-10">●</span> |
<span class="lake-fontsize-10">CSW</span> span> | <span class="lake-fontsize-10">●</span> | <span class="lake-fontsize-10">●</span> | <span class="lake-fontsize -10"> </span> | <span class="lake-fontsize-10"> </span> |
<span class="lake-fontsize-10">LSW</ | <span class="lake-fontsize-10">●</span> | <span class="lake-fontsize-10"> </span> | <span class="lake-fontsize-10"> ● </ span> | <span class = "Lake-fontSize-10"> ● </ span> |
<span class = "Lake-fontSize-10"> ASW </ span> | <span class ="lake-fontsize-10">●</span> | <span class="lake-fontsize-10"> </span> | <span class="lake-fontsize-10">●</span > | <span class="lake-fontsize-10">●</span> |
<span class="lake-fontsize-10"> </span> | <span class="lake-fontsize -10">input</span> | <span class="lake-fontsize-10">input</span> | <span class="lake-fontsize-10">input</span> | < span cl ass="lake-fontsize-10">input</span> |
<span> Chinese Name </ span> | <span> Cli </ span> <span> English title </ span> | <span> meaning </ span> |
<span>Single-machine self-check function</span> | <span>device_check</span> | <span> Check the health of each switch, its interfaces, routing, including the health status of each switch Judge and output the abnormal items of the network device itself. </span> |
<span>Core network direction diagnosis</span> | <span>core-network</span> | all interconnected physical paths involving the cloud server, 160e, <span> 511b01e1> The overall routing status or a specific physical machine is designated to determine and output network abnormal items. </span> |
<span>Dedicated Line Direction Diagnosis</span> | <span>Private direction</span> | and all instance-level VPC users are checked through the network (including resource VPC) 160e511b01e1b1 The overall status of the physical network involved in the period to determine the output network abnormal items. </span> |
<span>Public Network Direction Diagnosis</span> | <span>Internet Direction</span> | <sb01VPC>All resources in the cloud including the Internet check The overall condition of the physical network involved in the time to determine the output network abnormal items. </span> |
<span>Physical Virtual Direction Diagnosis</span> | <span>physics virtual</span> | classic network resources (including all resources checked) Including all cloud service resources) to determine the overall physical conditions between the output network abnormal items. </span> |
<span> scenarios </ span> | <span> troubleshooting instruction </ span> | <span> instruction Results </ span> |
<span> whole room Power down</span> | <span>ali_diag network ping project </span><span>{product name}</span> | <span>Check whether the connectivity of the physical machines in each cluster in the cloud is normal</span> |
<span> ali_diag Network of ping switch </ span> <span> {name} </ span> | <span> within the examination cloud connectivity switch is normal </ span> | |
<span> ali_diag network hardware power </span><span>{switch}</span> | <span>Check whether the power supply status of each switch is normal</span> | |
<span> span>{switch}</span> | <span> bpg routing switch status check </ span> | |
<span> Network device_check ali_scene </ span> | <span> POST switch hardware </ span> | |
<span> the ECS </ span> <span> access clouds nowhere </ span> | <span> ali_scene network internet_direction </ span> | <span> check public direction of network problems </ span> |
<span> ali_scene network private_direction</span> | <span>Check for link problems in the direction of the dedicated line</span> | |
<span>base</span><span>Failed to access the data source in | </span> >ali_scene network core_network</span> | <span>Device network connectivity check</span> |
<span>ali_scene network physics_virtual</ | > 160e511b01spanea comprehensive access network check> | |
the <span> physical machine failure on the line </ span> | <span> ali_scene Network core_network </ span> | <span> physical machine network checks </ span> |
<span> ali_diag network route bgp </span><span>{switch}</span> | <span>In-cloud bgp network status check</span> |
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。