Author: Wen Ting,

Introduction: This article mainly introduces the best practices of RocketMQ's observability tool in an online production environment. RocketMQ's observability capability leads similar products in the industry. RocketMQ's Dashboard and message trace functions escort the core business links and effectively deal with capacity planning, message sending and receiving problem troubleshooting and customization during online mass production and use. monitoring and other scenarios.

Introduction to Message Queuing

Before entering the topic, first briefly introduce what is Alibaba Cloud's message queue?

Alibaba Cloud provides a rich family of message products. The message product matrix covers various business scenarios such as the Internet, big data, and the Internet of Things. It provides multi-dimensional optional message solutions for cloud customers. No matter which message queuing product is, the core is to help users solve the asynchrony and decoupling of business and system, and to cope with peak shaving and valley filling during traffic peaks. At the same time, it has the characteristics of distributed, high throughput, low latency, and high scalability.

But different messaging products also have different emphases in customer-oriented business applications. Simply put, the message queue RocketMQ is the preferred message channel in the business field; Kafka is an indispensable message product in the big data field; MQTT is a message solution in the IoT field; RabbitMQ focuses on the traditional business message field; cloud-native product integration And the event stream channel is completed through the message queue MNS; finally, the event bus EventBridge is an event hub on Alibaba Cloud, which builds an event center in a unified way.

This article mainly talks about the preferred channel for messages in the business field: message queue RocketMQ. RocketMQ was born in Ali's e-commerce system. It has the capabilities of high performance, low latency, peak-shaving and valley-filling, etc. It also provides a wealth of functions to deal with instantaneous traffic peaks in business and message scenarios, and is integrated into the user's core business links. superior.

As a message on a core business link, RocketMQ is required to have very high observability capabilities. Users can monitor and locate abnormal fluctuations in a timely manner through observability capabilities, and at the same time troubleshoot specific business data problems. As a result, the observability capability has gradually become one of the core capabilities of the message queue RocketMQ.

So what is observability? The following is a brief introduction to the observability capability.

observability

When it comes to observability, you may first think of the three elements of observability: Metrics, Tracing, and Logging.

Combined with the understanding of message queues, the detailed explanation of the three elements of observability is as follows:

Metrics: Dashborad market

1) Rich coverage of indicators: includes indicators such as message volume, accumulation volume, and time-consuming of each stage. Each indicator is aggregated and displayed from multiple dimensions of instance, topic, and consumption GroupID;

2) Message Team Best Practice Template: provides the best template for users, especially in complex consumption message scenarios, provides rich indicators to help quickly locate problems, and continuously iteratively update;

3) Prometheus + Grafana: Prometheus standard data format, using Grafana display, in addition to templates, users can also customize the display panel.

Tracing: message tracking

1) OpenTelemetry tracing standard: RocketMQ tracing standard has been merged into OpenTelemetry open source standard to standardize and enrich messaging tracing scene definitions;

2) Customized display in the message field: abstract request span data according to the message dimension, display one-to-many consumption, and consume information multiple times, which is intuitive and easy to understand;

3) The upstream and downstream of the tracing link can be connected: message tracing can inherit the calling context and add it to the complete calling link. The message link information concatenates the upstream and downstream link information of the asynchronous link.

Logging: client log normalization

1) Error Code standardization: Different errors have unique error codes;

2) Error Message complete: contains complete error information and resource information required for sorting;

3) Error Level standardization: refines the log levels of various error messages, allowing users to configure and monitor alarms more appropriately according to Error, Warn and other levels.

Understanding the basic concepts of message queues and observable capabilities, let's see what sparks will happen when the message queue RocketMQ encounters observables?

Introduction to the concept of RocketMQ's observability tools

From the above introduction, we can see that RocketMQ's observable capabilities can help users to troubleshoot the problems in the production and consumption process of messages based on error information. In order to help you better understand the application of the function, first briefly introduce the message Some concepts in the production and consumption process.

Message production and consumption process concept

First, let's clarify the following concepts:

  • Topic: message topic, first-level message type, classify messages by topic;
  • Message: The carrier of information transmission in the message queue;
  • Broker: message relay role, responsible for storing and forwarding messages;
  • Producer: message producers, also known as message publishers, are responsible for producing and sending messages;
  • Consumer: Message consumers, also known as message subscribers, are responsible for receiving and consuming messages.

The process of message production and consumption is simply that the producer sends messages to the topic's MessageQueue for storage, and then consumers consume the messages on these MessageQueues. If there are multiple consumers, then a complete message production occurs. What does the life cycle look like?

Here we take the timing message as an example. The producer Producer sends the message to the MQ Server after a certain amount of time. MQ stores the message in the MessageQueue. At this time, there is a storage time in the queue. If it is a timing message, it needs to pass a certain timing time. After that, it can be consumed by consumers. This time is the time when the message is ready; after the time has elapsed, the consumer Consumer starts to consume, the consumer pulls the message from the MessageQueue, and then reaches the consumer client after the network time-consuming. It is not for low-code consumption. There will be a process of waiting for the consumer resource thread, and the real business message processing will only start after the consumer's thread resource.

As can be seen from the above introduction, business messages take a certain amount of time to process, and the ack result will be returned to the server only after completion. In the entire production and consumption process, the most complicated process is the consumption process, because the consumption of Due to time and other reasons, there will often be a scene of message accumulation. Let's focus on the meaning of each indicator in the message accumulation scene.

message stacking scene

As shown in the figure above, in the message queue, the messages in the gray part indicate the amount of completed messages, that is, the messages that the consumer has processed and returned ack; the messages in the orange part indicate that these messages have been pulled to the consumer client and are being processed by the consumer. Processing, but the message of the processing result has not been returned yet. This message actually has a very important indicator, that is, the message processing time; the last green message indicates that these messages have been stored in the MQ queue that has occurred and are already available. A state consumed by a consumer is called a ready message.

Ready messages:

Meaning: The number of ready messages.

Function: The size of the message volume reflects the size of the message that has not been consumed. In the case of abnormal consumers, the number of ready messages will increase.

message queue time (Queue time)

Meaning: The difference between the ready time of the earliest ready message and the current time.

Function: This time size reflects the time delay of messages that have not been processed, and is a very important metric for time-sensitive services.

Introduction to the functionality of RocketMQ's observability tool

Combined with the RocketMQ observability concept of the message queue introduced above, the following describes the two core functions of the RocketMQ observability tool.

Introduction to Observables - Dashboard

Dashboard can view the specified indicator data according to various parameters. The main indicator data includes the following three points:

1) Overview:

  • View the total number of messages sent and received, TPS, and message type distribution by instance.
  • Check the current distribution and sorting of each indicator: Topics that send the most messages, GroupIDs that consume the most messages, GroupIDs that accumulate the most messages, and GroupIDs that have the longest queuing time.

2) Topic (message sending):

  • View the graph of the amount of messages sent by the specified topic.
  • View the sending success rate curve graph of the specified topic.
  • View the sending time curve graph of the specified topic.

3) GroupID (message consumption):

  • View the message volume graph of the specified group subscribed to the specified topic.
  • View the consumption success rate of the specified topic subscribed to the specified group.
  • View indicators such as the consumption time of the specified Group subscription to the specified topic.
  • View the message accumulation-related indicators of the specified group subscribed to the specified topic.

Observable function introduction - message trace

In terms of Tracing, it provides the message tracking function, which mainly includes the following three capabilities:

1) Convenient query capability: can query related trajectories according to the basic information of the message; the second phase can also filter the query according to the result status and time-consuming time, and filter out the effective trajectories to quickly locate the problem.

2) Detailed tracing information: In addition to the time and time-consuming data of each life cycle, it also includes account and machine information of producers and consumers.

3) Optimize the display effect: 161f17e1a9d57c Different message type trajectories; scenarios of multiple consumption

11(1).png

Best Practices

Scenario 1: Troubleshooting

1) Goal: message production and consumption health

2) Principle

  • First-level indicators: indicators used for alarming, recognized indicators without objection.
  • Secondary indicators: When the primary indicators change, you can quickly locate the cause of the problem by viewing the secondary indicators.
  • Third-level indicators: locate the reasons for fluctuations in the second-level indicators. Add according to the characteristics and experience of their respective businesses.

Based on the goals and principles, the troubleshooting and analysis methods for producer users and consumer users are as follows:

Scenario 2: Capacity Planning

In the capacity planning scenario, only the following three problems need to be solved:

1) Question 1: How to evaluate instance capacity?

Solution:

  • View the statistics of the specified instance on the instance details page, and you can see the maximum TPS peak value of messages sent and received within the selected time period.
  • Platinum Edition instances can add alarm monitoring and judgment services based on this data.

2) Question 2: How to check the consumption of Standard Edition instances

Solution:

  • Can view the overview total message volume module

3) Question 3: Which ones have been offline and need to be cleaned up?

Solution:

  • Within a specified period of time (for example, nearly a week), sort by topic's message sending volume from small to large, and check whether there are topics with 0 message sending volume. These Topic-related businesses may have been offline.
  • Within a specified period of time (for example, nearly a week), sort the message consumption by GroupID from small to large, and check whether there is a GroupID with a message consumption of 0. The services related to these GroupIDs may be offline.

Scenario 3: Business Planning

In the business planning scenario, the following three problems are mainly solved:

1) Question 1: How to check the business peak distribution?

Solution:

  • View the daily peak hours of topic message reception.
  • Check the difference between the received messages of the topic on weekends and non-weeks.
  • View the change of topic message reception volume during holidays.

2) Question 2: How to judge which businesses are currently on the rise?

Solution:

  • View the message volume to help determine the change trend of the business volume.

3) Question 3: How to optimize consumer system performance?

Solution:

  • Check the message processing time and determine whether there is room for improvement within a reasonable range.

This article presents the visualization capabilities of RocketMQ's observability tools on the core business links through the introduction of message queues, observability capabilities, RocketMQ observability concepts and functions, and best practices. Some help in troubleshooting and O&M.

Click here , experience RocketMQ observability tools.


阿里云云原生
1k 声望302 粉丝