About With the continuous development of the company's business, the traffic is also increasing. We found that some major accidents in production are often swept across by sudden traffic. It is particularly important to manage and protect traffic and ensure the high availability of the system.
Author|Liang Yong
background
Haro has evolved into a combination of two-wheel travel (Hello bicycle, Haro moped, Haro electric car, Xiaoha battery replacement), four-wheel trip (Hello, ride-hailing, and Haro taxi), etc. We will develop a mobile travel platform, and explore many local life-oriented ecology such as hotels and in-store group purchases.
With the continuous development of the company's business, the traffic is also increasing. We found that some major accidents in production are often swept across by sudden traffic. It is particularly important to manage and protect traffic and ensure the high availability of the system.
This article shares the pits and accumulated experience that Haro has stepped on in the governance of message flow and microservice calls.
about the author
Liang Yong (Lao Liang), co-author of the column "RocketMQ Actual Combat and Advancement", participated in the review work of "RocketMQ Technical Insider". Lecturer of ArchSummit Global Architects Conference, lecturer of QCon case study club.
Currently, mainly in the direction of back-end middleware, more than 100 actual source code articles have been published on the official account [Gua Nong Lao Liang], covering RocketMQ series, Kafka series, GRPC series, Nacosl series, Sentinel series, Java NIO series. Currently working in Harbin Travel, as a senior technical expert.
Talk about governance
Before we start, let’s talk about governance. Here is Lao Liang’s personal understanding:
governance doing?
- Make our environment better
Need to know what is not good enough?
- Past experience
- customer feedback
- Industry comparison
also need to know if it is always good?
- Monitoring and tracking
- Alert notification
How to make it better when
- Governance measures
- Emergency plan
Catalog
- Create a distributed message governance platform
- RocketMQ actual combat stepping on the pit and solving
- Create a micro-service high-availability governance platform
background
RabbitMQ streaking
The company used RabbitMQ before, the following pain points when using RabbitMQ, many of which are caused by the current limit of RabbitMQ cluster.
- Is the excessive backlog cleaned up or not cleaned up? This is a problem, let me think about it again.
- Excessive backlog triggers cluster flow control? That really affects the business.
- Want to consume data from the first two days? Please resend it again.
- Which services are connected? You have to wait a little longer, I have to go find the IP to see.
- Are there any risks such as big news? I guess that.
Streaking service
Once there was such a failure, multiple businesses shared a database. In an evening peak traffic increased sharply, and the database was shut down.
- The single-machine database upgrade to the highest configuration still cannot be solved
- Slowly after restarting, after a while, it will be knocked down again
- Circulating, suffering, silently waiting for the peak to pass
Thinking: Both messages and services need perfect governance measures
Create a distributed message governance platform
Design guide
Which are our key indicators and which are our secondary indicators. This is the primary issue of message governance.
Design goals
Designed to shield the complexity of the underlying middleware (RocketMQ / Kafka), and dynamically route messages through unique identification. At the same time, an integrated message management platform integrating resource management and control, retrieval, monitoring, warning, inspection, disaster recovery, and visual operation and maintenance will be built to ensure the smooth and healthy operation of message middleware.
Points to consider in the design of the message governance platform
- Provide simple and easy-to-use API
- What are the key points to measure the use of the client without security risks
- What are the key indicators that can measure the health of the cluster?
- What are the common user/O&M operations to visualize it
- What measures are there to deal with these unhealthy
As simple and easy to use as possible
Design guide
It's ability to make complicated problems simple.
Minimalist unified API
Provide a unified SDK encapsulated (Kafka / RocketMQ) two message middleware.
Apply once
The automatic creation of theme consumer groups is not suitable for the production environment, and automatic creation will lead to loss of control, which is not conducive to the entire life cycle management and cluster stability. The application process needs to be controlled, but it should be as simple as possible. For example: apply for each environment to take effect at one time, generate associated alarm rules, etc.
Client governance
Design guide
Monitor whether the use of the client is standardized and find appropriate measures to govern
Scene playback
Scenario 1: Instantaneous traffic and cluster flow control
Assuming that the cluster Tps now has 10,000, and it instantly turns to 20,000 or more, this excessively steep increase in traffic is very likely to trigger cluster flow control. For this type of scenario, the sending speed of the client needs to be monitored, and the sending will become smoother after the speed and the steep increase threshold are met.
Scenario 2 big news and cluster jitter
When the client sends a large message, for example, sending a message of several hundred KB or even several megabytes, it may cause excessive IO time and cluster jitter. For this kind of scenario management, the size of the sent messages needs to be monitored. We adopt the service of identifying big messages through post-inspection, and promote the use of classmate compression or reconstruction, and the message is controlled within 10KB.
Scenario 3 is too low client version
With the iteration of functions, the SDK version will also be upgraded, and changes may introduce risks in addition to functions. When using a low version, one is that the function cannot be supported, and the other is that there may be security risks. In order to understand the use of the SDK, you can report the SDK version, and promote the use of students to upgrade through inspections.
Scenario 4: Consumption traffic removal and recovery
Consumer traffic removal and recovery usually have the following usage scenarios. The first is that the traffic needs to be removed when publishing the application, and the other is that the traffic needs to be removed before troubleshooting when the problem is located. In order to support this scenario, it is necessary to monitor removal/resumption events on the client side, and pause and resume consumption.
Scenario 5 Sending/consumption time-consuming detection
How long does it take to send/consume a message? By monitoring the time-consuming situation, patrol inspections to eliminate applications with low performance, and to promote the transformation to achieve the purpose of improving performance.
Scenario 6 Improve the efficiency of investigation and positioning
When troubleshooting, it is often necessary to retrieve content related to the life cycle of the message, such as what message was sent, where it exists, and when it was consumed. This part can connect the life cycle within the message through msgId. In addition, by embedding the rpcId/traceId similar link identifier in the message header, the message is stringed together in one request.
governance measures
Required monitoring information
- Sending/consumption speed
- Time-consuming to send/consume
- Message size
- Node information
- Link ID
- Version Information
Common governance measures
- Regular inspections: With the information of buried points, risky applications can be found through inspections. For example, the sending/consumption time is greater than 800 ms, the message size is greater than 10 KB, and the version is less than a specific version.
- Sending smoothing: For example, it is detected that the instantaneous flow rate meets 10,000 and has a steep increase of more than 2 times, and the instantaneous flow rate can be smoothed by preheating.
- Consumption flow limitation: When the third-party interface needs to limit the flow, the consumption flow can be limited. This part can be implemented in conjunction with the high-availability framework.
- Consumer removal: The consumer client is closed and restored by listening to the removal event.
Theme/Consumer Group Governance
Design guide
Monitor the resource usage of the topic consumer group
Scene playback
Scenario 1: The impact of consumption backlog on business
Some business scenarios are sensitive to the accumulation of consumption, and some businesses are not sensitive to the backlog, as long as they catch up and consume them later. For example, unlocking a bicycle is a matter of seconds, and batch processing scenarios related to information aggregation are not sensitive to the backlog. By collecting consumption backlog indicators, the application that meets the threshold will be notified in real-time to the students responsible for the application, so that they can grasp the consumption situation in real time.
Scenario 2: The impact of consumption/sending speed
Sending/consumption speed drops to zero warning? In some scenarios, the speed cannot drop to zero. If it drops to zero, it means that the business is abnormal. By collecting speed indicators, real-time alarms are given to applications that meet the threshold.
Scenario 3: Consumer node dropped
Consumer node offline needs to notify the students responsible for the application, this type of need to collect registered node information, when the offline can trigger an alarm notification in real time.
Scenario 4: Sending/consumption is not balanced
The imbalance of sending/consumption often affects its performance. I remember that during a consultation, some students set the key of sending messages to a constant. By default, the hash selection partition was performed according to the key. All the messages entered a partition. This performance could not be improved anyway. In addition, it is necessary to detect the consumption backlog of each district, and trigger real-time alarm notifications when excessive imbalance occurs.
governance measures
Required monitoring information
- Sending/consumption speed
- Send partition details
- Consumption backlog in various districts
- Consumer backlog
- Register node information
Common governance measures
- Real-time alarm: Real-time alarm notification for consumption backlog, sending/consumption speed, node offline, and partition imbalance.
- Improve performance: For consumption backlogs that cannot meet demand, it can be improved by increasing pull threads, consumption threads, and increasing the number of partitions.
- Self-service troubleshooting: Provide multi-dimensional retrieval tools, such as multi-dimensional retrieval of message life cycle through time range, msgId retrieval, link system, etc.
Cluster health governance
Design guide
What are the core indicators for measuring cluster health?
Scene playback
Scenario 1 cluster health detection
The cluster health check answers a question: Is this cluster good? This problem is solved by detecting the number of cluster nodes, the heartbeat of each node in the cluster, the Tps level of cluster writing, and the Tps level of cluster consumption.
Stability of Scenario Two Cluster
Cluster flow control often reflects the insufficiency of cluster performance, and cluster jitter will also cause the client to send out timeouts. By collecting the time-consuming heartbeat of each node in the cluster and the change rate of the Tps water level written by the cluster, we can grasp whether the cluster is stable.
High availability of scenario three clusters
High availability is mainly for extreme scenarios that cause a certain availability zone to be unavailable, or some topics and consumer groups on the cluster are abnormal, and some targeted measures are required. For example: MQ can be solved by cross-deployment of master and slave in the same city and cross-availability zone, dynamic migration of topics and consumer groups to disaster recovery clusters, and multiple activities.
governance measures
Required monitoring information
- Collecting the number of cluster nodes
- Time-consuming heartbeat of cluster nodes
- The water level at which the cluster writes to Tps
- The level of cluster consumption Tps
- Change rate of cluster write Tps
Common governance measures
- Regular inspection: Regular inspection of the cluster Tps water level and hardware water level.
- Disaster recovery measures: cross-deployment of master-slave across availability zones in the same city, dynamic migration of disaster recovery to disaster recovery clusters, and multiple activities in different places.
- Cluster tuning: system version/parameter, cluster parameter tuning.
- Cluster classification: classified by business line, and classified by core/non-core services.
The core indicators focus on
If one of these key indicators is the most important? I will choose the heartbeat detection of each node in the cluster, namely: response time (RT), let's look at the possible reasons that affect RT.
About alerts
- Most of the monitoring indicators are second-level detection
- The alarms that trigger the threshold are pushed to the company's unified alarm system, real-time notifications
- The risk notification of the inspection is pushed to the company's inspection system, and the weekly notification is summarized
Message platform icon
architecture diagram
Kanban icon
- Multi-dimensional: cluster dimension, application dimension
- Full aggregation: Full aggregation of key indicators
The pits and solutions stepped on in RocketMQ actual combat
Action guide
We will always encounter a pit, and fill it up when we encounter it.
1. RocketMQ cluster CPU glitch
Problem Description
**
RocketMQ slave nodes and master nodes frequently have high CPUs. Obvious glitches, many times the slave nodes directly hang up.
Only the system log has an error message
2020-03-16T17:56:07.505715+08:00 VECS0xxxx kernel:[] ? \_\_alloc\_pages\_nodemask+0x7e1/0x9602020-03-16T17:56:07.505717+08:00 VECS0xxxx kernel: java: page allocation failure. order:0, mode:0x202020-03-16T17:56:07.505719+08:00 VECS0xxxx kernel: Pid: 12845, comm: java Not tainted 2.6.32-754.17.1.el6.x86\_64 #12020-03-16T17:56:07.505721+08:00 VECS0xxxx kernel: Call Trace:2020-03-16T17:56:07.505724+08:00 VECS0xxxx kernel:[] ? \_\_alloc\_pages\_nodemask+0x7e1/0x9602020-03-16T17:56:07.505726+08:00 VECS0xxxx kernel: [] ? dev\_queue\_xmit+0xd0/0x3602020-03-16T17:56:07.505729+08:00 VECS0xxxx kernel: [] ? ip\_finish\_output+0x192/0x3802020-03-16T17:56:07.505732+08:00 VECS0xxxx kernel: [] ?
Various debugging system parameters can only slow down but cannot eradicate, and the burr still exceeds 50%
solution
Upgrade all systems in the cluster from centos 6 to centos 7, and the kernel version from 2.6 to 3.10, and the CPU glitch disappeared.
2. RocketMQ cluster online delayed message failure
Problem Description
RocketMQ Community Edition supports 18 delay levels by default, and each level is accurately consumed by consumers at the set time. For this reason, I have also tested whether the consumption interval is accurate, and the test results are very accurate. However, there is a problem with such an accurate feature. I received a report from a business classmate that a cluster of online delayed news was not consumed, which is weird!
solution
Move "delayOffset.json" and "consumequeue / SCHEDULE\_TOPIC\_XXXX" to other directories, which is equivalent to deleting; restart the broker nodes one by one. After the restart, after verification, the delayed message function is sent and consumed normally.
Create a micro-service high-availability governance platform
Design guide
Which are our core services and which are our non-core services, this is the primary issue of service governance
Design goals
The service can cope with the sudden increase in traffic, especially to ensure the smooth operation of core services.
Application classification and group deployment
Application classification
According to the two latitudes of user and business impact, the application is divided into four levels.
- Business impact: the scope of the business affected when the application fails
- User impact: the number of users affected when the app fails
S1: Core products, failures will cause external users to be unable to use or cause large capital losses, such as the core links of the main business, such as bicycles, moped switch locks, downwind car issuing and ordering core links, and its core Applications with strong link dependence.
S2: Does not directly affect the transaction, but is related to the management and maintenance of the important configuration of the front-end business or the function of the back-end business processing.
S3: Service failure has very little impact on users or core product logic, and has no impact on the main business, or a small new business; an important tool for internal users that does not directly affect the business, but related management functions have an impact on the front office business Also smaller.
S4: For internal users, the system does not directly affect the business or needs to be pushed offline in the future.
Group deployment
S1 service is the company's core service and the key guarantee object, and it is necessary to ensure that it is not accidentally impacted by non-core service traffic.
- S1 services are deployed in groups, divided into two environments: Stable and Standalone
- Non-core services call S1 service traffic to be routed to the Standalone environment
- S1 service calls non-core services need to be configured with a circuit breaker strategy
Various current-limiting fusing capacity construction
The high-availability platform capabilities we built
partial current limiting effect diagram
**
- Warm-up icon
- Waiting in line
- Warm up + queue
Highly Available Platform Icon
**
- All middleware access
- Dynamic configuration takes effect in real time
- Detailed traffic of each resource and IP node
to sum up
- Which are our key indicators and which are our secondary indicators. This is the primary issue of message governance
- Which are our core services and which are our non-core services, this is the primary issue of service governance
- Source code & actual combat is a better way to work and study.
Copyright Statement: content of this article was contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。