1
Introduction to Network Diagnostic Tool SreCli-Net

33A8F774-BC2B-4ecd-A2DE-03F19F98B652.png

1. background

The SRE operation and maintenance team is committed to improving the efficiency of operation and maintenance through automation, promoting the iterative transformation to intelligent operation and maintenance, and solving the pain points of traditional operation and maintenance. Although the traditional operation and maintenance has a complete operation and maintenance system, the operation and maintenance methods are different, and the operation and maintenance operations are complicated and time-consuming. How to improve the operation and maintenance efficiency of hybrid cloud projects, increase the added value of operation and maintenance and customer satisfaction, is still a tough problem for us.

The main challenges are as follows:

  • With the rapid development and evolution of customer services, the lag in traditional operation and maintenance has been enlarged

With the development of customer services and the continuous evolution of business models, the volume of business data is also increasing year by year. This brings more opportunities and challenges to operation and maintenance. How to ensure the stable, safe, and efficient operation of data in the cloud and business interactions inside and outside the cloud is a question worth considering for operation and maintenance personnel.

  • The operation of each system of the platform is complex, and the cost of operation and maintenance learning increases

With the rapid iteration of cloud platform cloud product versions, it is more difficult to become familiar with the platform. With the change of cloud product versions and the emergence of new functions, the cost of learning for novices increases, and the difficulty of familiarizing with the various operation and maintenance operations of the cloud platform has increased. But it cannot fundamentally solve the problem of rapid empowerment of operation and maintenance capabilities. All of this will trigger a series of "butterfly effects", and even cause high project risks or P-level failures, which will directly affect the normal use of customers' cloud services.

  • The ability of operation and maintenance personnel is uneven, and operation and maintenance operations are complicated

The current operation and maintenance methods have major problems such as manual experience judgment, various manual operations on the platform, inefficiency in handling problems, and long time-consuming emergency handling of faults. Due to the complexity of the system, technicians will waste a lot of time in guiding basic issues such as machine login and tool use when operating the platform. After logging in, I face the inconsistency of various addition, deletion, modification, and inspection operation instructions. With the long-term consumption of operation and maintenance, it will also cause on-site operation and maintenance personnel to be exhausted and unable to concentrate on online operations. Especially in the face of some inexperienced resident or customers, it often happens that the target machine is not found, the order is typed incorrectly, etc., which makes the overall operation and maintenance inefficient and frequent security risks.

Based on the above three aspects of operation and maintenance issues concerning customers, platforms, and operation and maintenance, improving the efficiency of operation and maintenance and reducing the learning cost of operation and maintenance personnel are the main tasks at present. In this context, the SRE-CLI tool was launched, which is a srecli tool that supports shell functions, command completion, problem diagnosis, and fault hemostasis, to gradually solve and improve the current status of the problem.

2. SRE-CLI basic introduction

SRE Command Line Interface (SRE CLI) is an operation and maintenance tool that allows you to use commands in the command line shell to perform operation and maintenance operations on the hybrid cloud. With minimal configuration, you can use the SRE CLI to run commands, so as to implement complex commands in the daily operation and maintenance process from the command prompt in the terminal program. Based on the experience of the "old Chinese doctor" in the daily work of problem handling and failure emergency precipitation, and integrated in the hybrid cloud through command line tools, you can run the SRE CLI without configuration, and realize daily operations through simple commands. Complicated operations in the maintenance process.

The CLI interaction capability model is mainly composed of four parts: access layer, interaction layer, back-end, and infrastructure. First, after logging in to SRECLI, the end user enters the interactive layer interface and completes the specified action by selecting the corresponding scene command and auxiliary function. This action will call the back-end tool capabilities and data in the data source, and calculate through the infrastructure layer. The result of the calculation and diagnosis will be directly output to the black screen interface of the terminal CLI to complete an entire interactive process, as shown in the figure below.

2.jpg

figure 1

  • Problem diagnosis (ali\_diag)

Refine high-frequency operations from service orders, work orders, and trouble tickets, and transform common operations, problems & failure points into atomic items. Query product atomic items, problem points, fault points, and fast query key indicators to locate problem points through daily operation and maintenance.

3.jpg

figure 2

  • Scene diagnosis (ali\_scene)

A series of troubleshooting ideas are precipitated in the fault scene, and output in the form of "three sharp axes" to accurately locate the problem. On this basis, the fault point is assembled and the fault is accurately located.

4.jpg

image 3

  • Emergency hemostasis (ali\_cure)

The true failure and risk hemostasis recovery means precipitate, and after the occurrence and the solution are determined, a quick recovery is required. The recovery actions include restarting, downgrading, current limiting, switching, etc. Help customers quickly recover their business.

  • Daily query (ali\_query)

Daily query, related data display, common information acquisition, through accurate query methods, query the product, routing, capacity, strategy and other information corresponding to the IP address in the cloud. Currently covering various IP dimension queries of physical networks.

  • Intelligent stream capture (ali\_trace)

Satisfy the CLI's ability to capture packets at various points in the cloud platform, through customized packet capture combination commands, quickly land at the packet capture point, and capture the network traffic in the specified in or out direction. Covers two types: classic network type packet capture and VPC network type packet capture.

3. Cli-Net concept

Cli-Net is a branch function in the CLI system, which is mainly responsible for processing the diagnosis and troubleshooting of the physical network direction in the hybrid cloud. Through the unified format of instructions, specific aspects of the diagnosis and diagnosis are performed in the physical network environment. Cli-Net mainly covers four aspects of the hybrid cloud physical network, including the performance diagnosis of general network equipment in the cloud, the diagnosis of the state of the cloud boundary network, the diagnosis of the state of the network in the cloud, and the diagnosis of the network state of the physical machine. It involves the operating status of physical machines and switch networks of all products in the cloud, as well as the troubleshooting and diagnosis of the Internet and IDC networks outside the cloud accessing the intra-cloud network. The specific diagnosis coverage is shown in the following table.

<span class="lake-fontsize-10"> <span>Cli-Net</span> </span><span class="lake-fontsize-10 <span size-10"> 160e511b01daa0 </span> </span> <span class="lake-fontsize-10"> <span>General network equipment performance diagnosis</span> </span> <span> fontsize-10"> <span>Cloud boundary network status diagnosis</span> </span> <span class="lake-fontsize-10"> <span>Cloud network status diagnosis</span> </span> <span class="lake-fontsize-10"> <span>Physical machine network status diagnosis</span> </span>
10">ISW</span> <span class="lake-fontsize-10">●</span> <span class="lake-fontsize-10">●</span> <span class="lake-fontsize-10"> </span> <span class="lake-fontsize-10"> </span>
<span class="lake-fontsize-10" >DSW</span> <span class="lake-fontsize-10">●</span> <span class="lake-fontsize-10"> </span> <span class=" lake-fontsize-10">●</span> <span class="lake-fontsize-10">●</span>
<span class="lake-fontsize-10">CSW</span> span> <span class="lake-fontsize-10">●</span> <span class="lake-fontsize-10">●</span> <span class="lake-fontsize -10"> </span> <span class="lake-fontsize-10"> </span>
<span class="lake-fontsize-10">LSW</ <span class="lake-fontsize-10">●</span> <span class="lake-fontsize-10"> </span> <span class="lake-fontsize-10"> ● </ span> <span class = "Lake-fontSize-10"> ● </ span>
<span class = "Lake-fontSize-10"> ASW </ span> <span class ="lake-fontsize-10">●</span> <span class="lake-fontsize-10"> </span> <span class="lake-fontsize-10">●</span > <span class="lake-fontsize-10">●</span>
<span class="lake-fontsize-10"> </span> <span class="lake-fontsize -10">input</span> <span class="lake-fontsize-10">input</span> <span class="lake-fontsize-10">input</span> < span cl ass="lake-fontsize-10">input</span>
# 4. Cli-Net main functions * Quickly log in to network equipment Access the space-based query through the CLI tool to quickly obtain the switch IP address, and use the built-in "password library" of the CLI tool to traverse the common passwords to quickly log in to the network device. If the common password traversal fails, it is judged that it has been modified to Project personality password. Then the CLI tool prompts please apply to the user and enter the personalized password after authorization, manually "fill in the personalized password", and then execute the follow-up content. With this function, the time to query the IP address and login password of the switch can be saved, which facilitates the login of network devices. 5.jpg Figure 4 Demo command: ali\_tools login switch $ switch role name 6.jpg Figure 5 * General network equipment performance diagnosis Cli-Net can check the performance of the hardware operating indicators of the switch itself, such as cpu, board, temperature, fan, memory, and power status. 7.jpg Figure 6 Demo instructions: ali\_diag network hardware COMMAND  [cpu\_usage]  [device] [environment]  [fan]  [memory] [power] 8.jpg Figure 7 * Cloud border network interconnection status diagnosis Health check of interconnected physical links between cloud platform switches ISW, CSW, DSW, ASW, and LSW. Check the interconnection status of the classic links between the roles, the interconnection status of the VPC dedicated line links, and the interconnection light attenuation status. 9.jpg Figure 8 Demo instructions: ali\_diag network interface COMMAND  [classic\_link]  [transceiver]  [vpc\_link] 10.jpg Picture 9 * In-cloud network interconnection status diagnosis Cloud platform switch routing protocol interconnection status check, by checking the BGP and OSPS protocol status, if abnormal, the abnormal status will be directly output. 11.jpg Picture 10 Demo instructions: ali\_diag network route [bgp] [ospf] 12.jpg Picture 11 * Connectivity status diagnosis Check the connectivity of physical servers and switches on the cloud platform. Test the connectivity of a physical machine name, cluster name, switch, etc. through ping. 13.jpg Picture 12 Demo instructions: ali\_diag network ping COMMAND  [nc]  [project]  [switch]  [virtual\_nc] 14.jpg Figure 13 # 5. Cli-Net scene diagnosis The Cli-Net scenario integrates the checkpoints of the main business data flow directions in the hybrid cloud physical network. Through the troubleshooting instructions specified in the Cli-scene scenario, quickly check various check items in the physical network environment through one-click diagnosis. Status, check and diagnosis items are mainly divided into five scenarios: stand-alone self-check, core network direction diagnosis, dedicated line direction diagnosis, public network direction diagnosis, and physical virtual direction diagnosis. The specific functions are shown in the following table:
<span> Chinese Name </ span> <span> Cli </ span> <span> English title </ span> <span> meaning </ span>
<span>Single-machine self-check function</span> <span>device_check</span> <span> Check the health of each switch, its interfaces, routing, including the health status of each switch Judge and output the abnormal items of the network device itself. </span>
<span>Core network direction diagnosis</span> <span>core-network</span> all interconnected physical paths involving the cloud server, 160e, <span> 511b01e1> The overall routing status or a specific physical machine is designated to determine and output network abnormal items. </span>
<span>Dedicated Line Direction Diagnosis</span> <span>Private direction</span> and all instance-level VPC users are checked through the network (including resource VPC) 160e511b01e1b1 The overall status of the physical network involved in the period to determine the output network abnormal items. </span>
<span>Public Network Direction Diagnosis</span> <span>Internet Direction</span> <sb01VPC>All resources in the cloud including the Internet check The overall condition of the physical network involved in the time to determine the output network abnormal items. </span>
<span>Physical Virtual Direction Diagnosis</span> <span>physics virtual</span> classic network resources (including all resources checked) Including all cloud service resources) to determine the overall physical conditions between the output network abnormal items. </span>
# 6. Cli-Net scene structure * The full self-check scene structure of a single machine is shown in the figure below. 15.jpg Figure 14 * The Core-network scene structure is shown in the figure below. 16.jpg Figure 15 * The private direction scene structure is shown in the figure below. 17.jpg Figure 16 * The Internet Direction scene structure is shown in the figure below. 18.jpg Figure 17 Diagnostic instructions: ali\_scene network COMMAND  [core\_network]  [device\_check]  [internet\_direction]  [physics\_virtual] 19.jpg Figure 18 Demo instructions: ali\_scene network COMMAND  [core\_network]  [device\_check]  [internet\_direction]  [physics\_virtual] 20.jpg Figure 19 21.jpg Picture 20 # 7. Cli-Net application practice > 160e511b01spanea comprehensive access network check
<span> scenarios </ span> <span> troubleshooting instruction </ span> <span> instruction Results </ span>
<span> whole room Power down</span> <span>ali_diag network ping project </span><span>{product name}</span> <span>Check whether the connectivity of the physical machines in each cluster in the cloud is normal</span>
<span> ali_diag Network of ping switch </ span> <span> {name} </ span> <span> within the examination cloud connectivity switch is normal </ span>
<span> ali_diag network hardware power </span><span>{switch}</span> <span>Check whether the power supply status of each switch is normal</span>
<span> span>{switch}</span> <span> bpg routing switch status check </ span>
<span> Network device_check ali_scene </ span> <span> POST switch hardware </ span>
<span> the ECS </ span> <span> access clouds nowhere </ span> <span> ali_scene network internet_direction </ span> <span> check public direction of network problems </ span>
<span> ali_scene network private_direction</span> <span>Check for link problems in the direction of the dedicated line</span>
<span>base</span><span>Failed to access the data source in </span> >ali_scene network core_network</span> <span>Device network connectivity check</span>
<span>ali_scene network physics_virtual</ >
the <span> physical machine failure on the line </ span> <span> ali_scene Network core_network </ span> <span> physical machine network checks </ span>
<span> ali_diag network route bgp </span><span>{switch}</span> <span>In-cloud bgp network status check</span>
The above table lists different troubleshooting instructions referenced by different scenarios. The troubleshooting instructions are used to diagnose the physical environment in the cloud and determine whether there is an abnormal phenomenon. The above is only the inspection of the physical network environment. If you need to check the specific product side status, you need to combine the specific product diagnosis status. The combination of the network side and the product side can achieve the effect of rapid diagnosis and investigation. We are the Alibaba Cloud Intelligent Global Technical Service-SRE team. We are committed to becoming a technology-based, service-oriented, and high-availability engineer team of business systems; providing professional and systematic SRE services to help customers make better use of the cloud 、Build a more stable and reliable business system based on the cloud to improve business stability. We hope to share more technologies that help enterprise customers go to the cloud, make good use of the cloud, and make their business operations on the cloud more stable and reliable. You can scan the QR code below to join the Alibaba Cloud SRE Technical Institute Dingding circle, and more The multi-cloud master communicates about those things about the cloud platform. > Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。