Haro&#39;s practice in distributed message governance and microservice governance

About With the continuous development of the company's business, the traffic is also increasing. We found that some major accidents in production are often swept across by sudden traffic. It is particularly important to manage and protect traffic and ensure the high availability of the system.

头图.jpg

Author｜Liang Yong

background

Haro has evolved into a combination of two-wheel travel (Hello bicycle, Haro moped, Haro electric car, Xiaoha battery replacement), four-wheel trip (Hello, ride-hailing, and Haro taxi), etc. We will develop a mobile travel platform, and explore many local life-oriented ecology such as hotels and in-store group purchases.

With the continuous development of the company's business, the traffic is also increasing. We found that some major accidents in production are often swept across by sudden traffic. It is particularly important to manage and protect traffic and ensure the high availability of the system.

This article shares the pits and accumulated experience that Haro has stepped on in the governance of message flow and microservice calls.

about the author

Liang Yong (Lao Liang), co-author of the column "RocketMQ Actual Combat and Advancement", participated in the review work of "RocketMQ Technical Insider". Lecturer of ArchSummit Global Architects Conference, lecturer of QCon case study club.

Currently, mainly in the direction of back-end middleware, more than 100 actual source code articles have been published on the official account [Gua Nong Lao Liang], covering RocketMQ series, Kafka series, GRPC series, Nacosl series, Sentinel series, Java NIO series. Currently working in Harbin Travel, as a senior technical expert.

Talk about governance

Before we start, let’s talk about governance. Here is Lao Liang’s personal understanding:

governance doing?

Make our environment better

Need to know what is not good enough?

Past experience
customer feedback
Industry comparison

also need to know if it is always good?

Monitoring and tracking
Alert notification

How to make it better when

Governance measures
Emergency plan

Catalog

Create a distributed message governance platform
RocketMQ actual combat stepping on the pit and solving
Create a micro-service high-availability governance platform

background

RabbitMQ streaking

The company used RabbitMQ before, the following pain points when using RabbitMQ, many of which are caused by the current limit of RabbitMQ cluster.

Is the excessive backlog cleaned up or not cleaned up? This is a problem, let me think about it again.
Excessive backlog triggers cluster flow control? That really affects the business.
Want to consume data from the first two days? Please resend it again.
Which services are connected? You have to wait a little longer, I have to go find the IP to see.
Are there any risks such as big news? I guess that.

Streaking service

Once there was such a failure, multiple businesses shared a database. In an evening peak traffic increased sharply, and the database was shut down.

The single-machine database upgrade to the highest configuration still cannot be solved
Slowly after restarting, after a while, it will be knocked down again
Circulating, suffering, silently waiting for the peak to pass

Thinking: Both messages and services need perfect governance measures

Create a distributed message governance platform

Design guide

Which are our key indicators and which are our secondary indicators. This is the primary issue of message governance.

Design goals

Designed to shield the complexity of the underlying middleware (RocketMQ / Kafka), and dynamically route messages through unique identification. At the same time, an integrated message management platform integrating resource management and control, retrieval, monitoring, warning, inspection, disaster recovery, and visual operation and maintenance will be built to ensure the smooth and healthy operation of message middleware.

Points to consider in the design of the message governance platform

Provide simple and easy-to-use API
What are the key points to measure the use of the client without security risks
What are the key indicators that can measure the health of the cluster?
What are the common user/O&M operations to visualize it
What measures are there to deal with these unhealthy

As simple and easy to use as possible

Design guide

It's ability to make complicated problems simple.

Minimalist unified API

Provide a unified SDK encapsulated (Kafka / RocketMQ) two message middleware.

Apply once

The automatic creation of theme consumer groups is not suitable for the production environment, and automatic creation will lead to loss of control, which is not conducive to the entire life cycle management and cluster stability. The application process needs to be controlled, but it should be as simple as possible. For example: apply for each environment to take effect at one time, generate associated alarm rules, etc.

Client governance

Design guide

Monitor whether the use of the client is standardized and find appropriate measures to govern

Scene playback

Scenario 1: Instantaneous traffic and cluster flow control

Assuming that the cluster Tps now has 10,000, and it instantly turns to 20,000 or more, this excessively steep increase in traffic is very likely to trigger cluster flow control. For this type of scenario, the sending speed of the client needs to be monitored, and the sending will become smoother after the speed and the steep increase threshold are met.

Scenario 2 big news and cluster jitter

When the client sends a large message, for example, sending a message of several hundred KB or even several megabytes, it may cause excessive IO time and cluster jitter. For this kind of scenario management, the size of the sent messages needs to be monitored. We adopt the service of identifying big messages through post-inspection, and promote the use of classmate compression or reconstruction, and the message is controlled within 10KB.

Scenario 3 is too low client version

With the iteration of functions, the SDK version will also be upgraded, and changes may introduce risks in addition to functions. When using a low version, one is that the function cannot be supported, and the other is that there may be security risks. In order to understand the use of the SDK, you can report the SDK version, and promote the use of students to upgrade through inspections.

Scenario 4: Consumption traffic removal and recovery

Consumer traffic removal and recovery usually have the following usage scenarios. The first is that the traffic needs to be removed when publishing the application, and the other is that the traffic needs to be removed before troubleshooting when the problem is located. In order to support this scenario, it is necessary to monitor removal/resumption events on the client side, and pause and resume consumption.

Scenario 5 Sending/consumption time-consuming detection

How long does it take to send/consume a message? By monitoring the time-consuming situation, patrol inspections to eliminate applications with low performance, and to promote the transformation to achieve the purpose of improving performance.

Scenario 6 Improve the efficiency of investigation and positioning

When troubleshooting, it is often necessary to retrieve content related to the life cycle of the message, such as what message was sent, where it exists, and when it was consumed. This part can connect the life cycle within the message through msgId. In addition, by embedding the rpcId/traceId similar link identifier in the message header, the message is stringed together in one request.

governance measures

Required monitoring information

Sending/consumption speed
Time-consuming to send/consume
Message size
Node information
Link ID
Version Information

Common governance measures

Regular inspections: With the information of buried points, risky applications can be found through inspections. For example, the sending/consumption time is greater than 800 ms, the message size is greater than 10 KB, and the version is less than a specific version.
Sending smoothing: For example, it is detected that the instantaneous flow rate meets 10,000 and has a steep increase of more than 2 times, and the instantaneous flow rate can be smoothed by preheating.
Consumption flow limitation: When the third-party interface needs to limit the flow, the consumption flow can be limited. This part can be implemented in conjunction with the high-availability framework.
Consumer removal: The consumer client is closed and restored by listening to the removal event.

Theme/Consumer Group Governance

Design guide

Monitor the resource usage of the topic consumer group

Scene playback

Scenario 1: The impact of consumption backlog on business

Some business scenarios are sensitive to the accumulation of consumption, and some businesses are not sensitive to the backlog, as long as they catch up and consume them later. For example, unlocking a bicycle is a matter of seconds, and batch processing scenarios related to information aggregation are not sensitive to the backlog. By collecting consumption backlog indicators, the application that meets the threshold will be notified in real-time to the students responsible for the application, so that they can grasp the consumption situation in real time.

Scenario 2: The impact of consumption/sending speed

Sending/consumption speed drops to zero warning? In some scenarios, the speed cannot drop to zero. If it drops to zero, it means that the business is abnormal. By collecting speed indicators, real-time alarms are given to applications that meet the threshold.

Scenario 3: Consumer node dropped

Consumer node offline needs to notify the students responsible for the application, this type of need to collect registered node information, when the offline can trigger an alarm notification in real time.

Scenario 4: Sending/consumption is not balanced

The imbalance of sending/consumption often affects its performance. I remember that during a consultation, some students set the key of sending messages to a constant. By default, the hash selection partition was performed according to the key. All the messages entered a partition. This performance could not be improved anyway. In addition, it is necessary to detect the consumption backlog of each district, and trigger real-time alarm notifications when excessive imbalance occurs.

governance measures

Required monitoring information

Sending/consumption speed
Send partition details
Consumption backlog in various districts
Consumer backlog
Register node information

Common governance measures

Real-time alarm: Real-time alarm notification for consumption backlog, sending/consumption speed, node offline, and partition imbalance.
Improve performance: For consumption backlogs that cannot meet demand, it can be improved by increasing pull threads, consumption threads, and increasing the number of partitions.
Self-service troubleshooting: Provide multi-dimensional retrieval tools, such as multi-dimensional retrieval of message life cycle through time range, msgId retrieval, link system, etc.

Cluster health governance

Design guide

What are the core indicators for measuring cluster health?

Scene playback

Scenario 1 cluster health detection

The cluster health check answers a question: Is this cluster good? This problem is solved by detecting the number of cluster nodes, the heartbeat of each node in the cluster, the Tps level of cluster writing, and the Tps level of cluster consumption.

Stability of Scenario Two Cluster

Cluster flow control often reflects the insufficiency of cluster performance, and cluster jitter will also cause the client to send out timeouts. By collecting the time-consuming heartbeat of each node in the cluster and the change rate of the Tps water level written by the cluster, we can grasp whether the cluster is stable.

High availability of scenario three clusters

High availability is mainly for extreme scenarios that cause a certain availability zone to be unavailable, or some topics and consumer groups on the cluster are abnormal, and some targeted measures are required. For example: MQ can be solved by cross-deployment of master and slave in the same city and cross-availability zone, dynamic migration of topics and consumer groups to disaster recovery clusters, and multiple activities.

governance measures

Required monitoring information

Collecting the number of cluster nodes
Time-consuming heartbeat of cluster nodes
The water level at which the cluster writes to Tps
The level of cluster consumption Tps
Change rate of cluster write Tps

Common governance measures

Regular inspection: Regular inspection of the cluster Tps water level and hardware water level.
Disaster recovery measures: cross-deployment of master-slave across availability zones in the same city, dynamic migration of disaster recovery to disaster recovery clusters, and multiple activities in different places.
Cluster tuning: system version/parameter, cluster parameter tuning.
Cluster classification: classified by business line, and classified by core/non-core services.

The core indicators focus on

If one of these key indicators is the most important? I will choose the heartbeat detection of each node in the cluster, namely: response time (RT), let's look at the possible reasons that affect RT.

About alerts

Most of the monitoring indicators are second-level detection
The alarms that trigger the threshold are pushed to the company's unified alarm system, real-time notifications
The risk notification of the inspection is pushed to the company's inspection system, and the weekly notification is summarized

Message platform icon

architecture diagram

Kanban icon

Multi-dimensional: cluster dimension, application dimension
Full aggregation: Full aggregation of key indicators

The pits and solutions stepped on in RocketMQ actual combat

Action guide

We will always encounter a pit, and fill it up when we encounter it.

1. RocketMQ cluster CPU glitch

Problem Description

RocketMQ slave nodes and master nodes frequently have high CPUs. Obvious glitches, many times the slave nodes directly hang up.

Only the system log has an error message

2020-03-16T17:56:07.505715+08:00 VECS0xxxx kernel:[] ? \_\_alloc\_pages\_nodemask+0x7e1/0x9602020-03-16T17:56:07.505717+08:00 VECS0xxxx kernel: java: page allocation failure. order:0, mode:0x202020-03-16T17:56:07.505719+08:00 VECS0xxxx kernel: Pid: 12845, comm: java Not tainted 2.6.32-754.17.1.el6.x86\_64 #12020-03-16T17:56:07.505721+08:00 VECS0xxxx kernel: Call Trace:2020-03-16T17:56:07.505724+08:00 VECS0xxxx kernel:[] ? \_\_alloc\_pages\_nodemask+0x7e1/0x9602020-03-16T17:56:07.505726+08:00 VECS0xxxx kernel: [] ? dev\_queue\_xmit+0xd0/0x3602020-03-16T17:56:07.505729+08:00 VECS0xxxx kernel: [] ? ip\_finish\_output+0x192/0x3802020-03-16T17:56:07.505732+08:00 VECS0xxxx kernel: [] ?

Various debugging system parameters can only slow down but cannot eradicate, and the burr still exceeds 50%

solution

Upgrade all systems in the cluster from centos 6 to centos 7, and the kernel version from 2.6 to 3.10, and the CPU glitch disappeared.

2. RocketMQ cluster online delayed message failure

Problem Description

RocketMQ Community Edition supports 18 delay levels by default, and each level is accurately consumed by consumers at the set time. For this reason, I have also tested whether the consumption interval is accurate, and the test results are very accurate. However, there is a problem with such an accurate feature. I received a report from a business classmate that a cluster of online delayed news was not consumed, which is weird!

solution

Move "delayOffset.json" and "consumequeue / SCHEDULE\_TOPIC\_XXXX" to other directories, which is equivalent to deleting; restart the broker nodes one by one. After the restart, after verification, the delayed message function is sent and consumed normally.

Create a micro-service high-availability governance platform

Design guide

Which are our core services and which are our non-core services, this is the primary issue of service governance

Design goals

The service can cope with the sudden increase in traffic, especially to ensure the smooth operation of core services.

Application classification and group deployment

Application classification

According to the two latitudes of user and business impact, the application is divided into four levels.

Business impact: the scope of the business affected when the application fails
User impact: the number of users affected when the app fails

S1: Core products, failures will cause external users to be unable to use or cause large capital losses, such as the core links of the main business, such as bicycles, moped switch locks, downwind car issuing and ordering core links, and its core Applications with strong link dependence.

S2: Does not directly affect the transaction, but is related to the management and maintenance of the important configuration of the front-end business or the function of the back-end business processing.

S3: Service failure has very little impact on users or core product logic, and has no impact on the main business, or a small new business; an important tool for internal users that does not directly affect the business, but related management functions have an impact on the front office business Also smaller.

S4: For internal users, the system does not directly affect the business or needs to be pushed offline in the future.

Group deployment

S1 service is the company's core service and the key guarantee object, and it is necessary to ensure that it is not accidentally impacted by non-core service traffic.

S1 services are deployed in groups, divided into two environments: Stable and Standalone
Non-core services call S1 service traffic to be routed to the Standalone environment
S1 service calls non-core services need to be configured with a circuit breaker strategy

Various current-limiting fusing capacity construction

The high-availability platform capabilities we built

partial current limiting effect diagram

Warm-up icon

Waiting in line

Warm up + queue

Highly Available Platform Icon

All middleware access
Dynamic configuration takes effect in real time
Detailed traffic of each resource and IP node

to sum up

Which are our key indicators and which are our secondary indicators. This is the primary issue of message governance
Which are our core services and which are our non-core services, this is the primary issue of service governance
Source code & actual combat is a better way to work and study.

Copyright Statement: content of this article was contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Haro&#39;s practice in distributed message governance and microservice governance

Author｜Liang Yong

background

about the author

Talk about governance

governance doing?

​

Need to know what is not good enough?

​

also need to know if it is always good?

​

How to make it better when

​

Catalog

background

RabbitMQ streaking

Streaking service

Create a distributed message governance platform

Design guide

Points to consider in the design of the message governance platform

​

As simple and easy to use as possible

Design guide

Apply once

Client governance

Design guide

Scene playback

governance measures

Theme/Consumer Group Governance

Design guide

Scene playback

governance measures

Cluster health governance

Design guide

Scene playback

governance measures

The core indicators focus on

About alerts

Message platform icon

architecture diagram

Kanban icon

The pits and solutions stepped on in RocketMQ actual combat

Action guide

1. RocketMQ cluster CPU glitch

Problem Description

solution

2. RocketMQ cluster online delayed message failure

Problem Description

solution

Create a micro-service high-availability governance platform

Design guide

Design goals

Application classification and group deployment

Application classification

Group deployment

Various current-limiting fusing capacity construction

The high-availability platform capabilities we built

partial current limiting effect diagram

Highly Available Platform Icon

to sum up

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性

Haro's practice in distributed message governance and microservice governance