Introduction to Fengshen-Core Function | Dingding Alarm + Data Gateway
1. Development background
1.1 User pain points
①The tenant side operation and maintenance ability is weak
Problem: On the tenant side, there is no effective way for customers to obtain instance-level status, performance, and capacity data in a timely manner.
Status: Daily at a fixed time. Residents need human flesh to collect data, and Dingding regularly pushes it to customers.
②Inefficient troubleshooting
Problem: There is a problem with the application business, the cloud platform product is normal, the customer does not approve it, and the customer needs to help solve the problem.
Status: It is found that the performance and capacity of the application instance are full. The troubleshooting process is often lengthy and inefficient.
③Lack of monitoring ability
Problem: Cloud platform monitoring is incomplete, and reporting capabilities such as capacity management and performance management are lacking.
Status: Residents need to pass a large number of human flesh inspections or write scripts.
④The monitoring method has low timeliness
Problem: The business side always has priority over the application and cloud platform to detect failures, and the operation and maintenance is very passive.
Status: The customer discovers a problem, informs the application, and after the application is checked, traces the source to the cloud platform to troubleshoot the link serial and low efficiency.
1.2 Solution
①Ensure business stability
Through the changes in the service capabilities of cloud products and the establishment of business simulation models, the customer's business health is predicted in advance, and an alarm will be triggered when it falls below the baseline.
②SLA display
Trigger threshold to automatically alarm to quantify product health.
2. Development and design
2.1 System architecture
Figure 1: System architecture diagram
The Fengshen system architecture is shown in Figure 1, which is divided into two modules: CLIENT and SERVER.
- CLINET: Deployed in the classic Tongque container, collects product data in the cloud through timing task control.
SERVER: Deployed on the ECS in the VPC, the system framework is FLASK, which is divided into two parts: data processing and data storage.
①Data processing refers to receiving data from CLIENT by providing API and performing storage operations and front-end display of data.
②Data storage refers to the persistent operation of data with the aid of Alibaba Cloud RDS database.
2.2 Business Architecture
Figure 2: Business architecture diagram
The business structure of Fengshen is shown in 2, which is divided into five major sections.
- Jiang Ziya: Tenant-side alarms, mainly including ECS, RDS and other cloud product instance performance and business-related alarms.
- Shen Gongbao: O&M side alarms, mainly including cloud product health status, water level capacity and other related alarms.
- Lei Zhenzi: Hardware alarms, mainly including bad disks and physical machine out-of-band alarms.
- Bigan: Security alarms, mainly from cloud shield related security alarms.
- Yang Jian: Fault alarm, mainly performs SLA algorithm processing on each product data, and sets P0 and P1 level fault thresholds.
3. Dingding alarm
3.1 Alarm classification
For the creation method of the robot, please refer to the following information [1] for details.
Jiang Ziya
Shen Gongbao
Lei Zhenzi
Bigan
Yang Jian
3.2 Alarm display
Picture 3: Jiang Ziya
Picture 4: Shen Gongbao
Picture 5: Lei Zhenzi
Figure 6: Bigan
Picture 7: Yang Jian-1
Figure 7: Yang Jian-2
Figure 7: Yang Jian-3
Picture 7: Yang Jian-4
4. Data Gateway
The data gateway is divided into two major modules: obtaining data and receiving data.
The acquired data is divided into alarm data, full data, and performance data.
①Alarm data: Corresponding to the alarm information pushed by the Dingding robot, encapsulated into the corresponding data format, and provide data to the outside world in the form of an API interface.
②Full data: Data from the source table of the database, without any processing, provides data externally in the form of an API interface, which has high operability.
③Performance data: Product performance data will be regularly stored in the time series database for a long storage time, and historical performance data can be queried.- Receiving data: Provide external API to receive customer-defined monitoring data, encapsulate it in MARKDOWN format, and provide real-time nailing alarms.
4.1 Get data
4.1.1 Alarm data
4.1.1.1 Request interface
Request method: POST request
URL address: http://{ip}:{port}/api/v1/search/monitor
ip: Fengshen ecs\_ip
port:9170
PARAM: For the parameter list, please refer to the document [2] for details.
4.1.1.2 DEMO
import sys
import requests
url = "http://{ip}:{port}/api/v1/search/monitor/"
data = {"product":"MQ", "title":"积压告警", "stime":"2020-01-04 00:00:00", "etime":"2020-01-04 00:01:00"}
res = requests.post(url=url, json=data)
print res.content
curl -H "Content-Type:application/json"
-X POST -d '{"type":"ALL"}' http://{ip}:{port}/api/v1/search/monitor/
4.1.1.3 Data return
①There is an alarm currently
{"code":0, "data":[{"info":"0.0.0.0,ecs,95% \n 0.0.0.1,ecs,95% ", "product":"ECS", "title": "Performance warning", "level":"alarm", "robot":"Jiang Ziya", "monitor\_time":"2020-01-14 00:00:00", "columns":"IP, product, value "}]}
②There is no alarm data currently (the alarm returns to normal)
{"code":0, "data":[{"info":"", "product":"ECS", "title":"performance warning", "level":"warning", "robot":" Jiang Ziya", "monitor\_time":"2020-01-14 00:00:00", "columns":"IP, product, value"}]}
③Data not found:
{"code":0, "data":[]}
④ Query exception:
{"code":500, "data":"Exception Information"}
4.1.2 Full data
4.1.2.1 Request interface
Request method: POST request
URL address: http://{ip}:{port}/api/v1/search/data/
ip: Fengshen ecs\_ip
port:9170
PARAM: For the parameter list, please refer to the document [2] for details.
4.1.2.2 DEMO
import sys
import requests
url = "http://{ip}:{port}/api/v1/search/data/"
data = {"product":"MQ", "title":"TIME", "stime":"2020-01-04 00:00:00", "etime":"2020-01-04 00:01:00"}
res = requests.post(url=url, json=data)
print res.content
4.1.2.3 Data return
4.1.3 Performance data
4.1.3.1 Request interface
Request method: POST request
URL address: http://{ip}:{port}/api/v1/influxdb \_query/
ip: Fengshen ecs\_ip
port:9170
PARAM: The parameter is INFLUXDB SQL
4.1.3.2 DEMO
import sys
import requests
url = "http://{ip}:{port}/api/v1/influxdb_query/"
data = {"sql":"infudb sql"}
res = requests.post(url, data)
print res.content
4.1.3.3 Data return
4.2 Receive data
4.2.1 Request interface
Request method: POST request
URL address: http://{ip}:{port}/api/v1/insert/third
ip: Fengshen ecs\_ip
port:9170
PARAM:
4.2.2 DEMO
import sys
import requests
url = "http://172.0.0.1:9170/api/v1/insert/third/"
data = {"title":"ecs性能监控", "level":"预警", "source":"云监控", "product":"ecs", "msg":"ip:10.0.0.1 cpu:98% ip:127.0.0.1 mem:99%", "robot":"姜子牙", "submitor":"高德臣", "monitor_time":"2021-03-10 16:00:00", "details":"兄弟 关注下"}
res = requests.post(url=url, json=data)
print res.text
4.2.3 Alarm display
Figure 8: Alarm display diagram
Reference article
[1] Pre-check for the deployment of the Fengshen : 160c1bf8283ba2 https://yuque.antfin-inc.com/docs/share/d3a743db-af85-47d2-89c5-4f22eb1693c5?
[2] Obtain Fengshen Data-Three- API: 160c1bf8283bbb https://yuque.antfin-inc.com/docs/share/2037fbb2-35fa-42ad-8476-ec7502e9ed33?#
We are the Alibaba Cloud Intelligent Global Technical Service-SRE team. We are committed to becoming a technology-based, service-oriented, and high-availability engineer team of business systems; providing professional and systematic SRE services to help customers make better use of the cloud 、Build a more stable and reliable business system based on the cloud to improve business stability. We hope to share more technologies that help enterprise customers go to the cloud, make good use of the cloud, and make their business operations on the cloud more stable and reliable. You can scan the QR code below to join the Alibaba Cloud SRE Technical Institute Dingding circle, and more The multi-cloud master communicates about those things about the cloud platform.
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。