The best practice of cloud-based large-scale distributed applications based on message queue RocketMQ

Author | Shao
Review & Proofreading: Years, Jiajia
Editing & Typesetting: Wen Yan

Preface

Message queue is an important infrastructure of distributed Internet architecture, and it has important applications in the following scenarios:

Application decoupling
Peak shaving and valley filling
Asynchronous notification
Distributed transaction
Big data processing

It also involves interactive live broadcast, mobile Internet & Internet of Things, IM real-time communication, Cache synchronization, log monitoring and other fields.

And this article mainly focuses on the commercial version of the message queue RocketMQ, compares it with the open source version RocketMQ, and combines some practical scenarios to show the best practices of large-scale distributed applications on the cloud.

Core competence

The commercial version of the message queue RocketMQ has the following advantages compared with the open source version RocketMQ and other competing products.

Out-of-the-box, feature-rich
High performance, unlimited scalability
Observable, free of operation and maintenance capabilities
High SLA and stability guarantee

Out-of-the-box, feature-rich

The message queue RocketMQ provides support for multiple types of messages such as timing, transaction, and sequence, and supports two consumption modes of broadcast and cluster; in addition, at the protocol level, it provides TCP/HTTP multi-protocol support and also provides TAG/SQL attribute filtering functions. It greatly broadens the user's usage scenarios.

High performance, unlimited expansion capabilities

The message queue RocketMQ has withstood the test of the double eleven peaks of Ali’s core e-commerce over the years. It supports tens of millions of TPS messaging and billions of message accumulation capabilities, and can provide millisecond end-to-end delay guarantee for messages, and also provides hierarchical storage. , Support any storage time of massive messages.

Observable, free of operation and maintenance capabilities

The message queue RocketMQ provides an observability market, supports fine-grained data, provides message full-link life cycle tracking and query capabilities, and provides corresponding monitoring and alarm functions for various indicators; in addition, it also provides message backtracking and death The message queue function can ensure that the user's messages can be retrospectively consumed at any time.

High SLA and stability guarantee

The stability of the message queue RocketMQ is an important area of our consistent, continuous and stable investment. It provides high-availability deployment and multi-copy write functions; in addition, it also supports multi-AZ disaster recovery in the same city and multiple activities in different places.

Product profile

Next, we will select a few sections from the above core capabilities of the product, and combine specific scenarios and practices to make further introductions.

Multi-message type support

High-availability sequence message

The sequential messages used by the commercial version of the message queue RocketMQ are called high-availability sequential messages. Before introducing the high-availability sequence message, first briefly introduce the sequence message of the open source version RocketMQ.

Sequence messages are divided into two types, global sequence messages and partition sequence messages.

Global sequence message: Only one partition is allocated in the RocketMQ storage layer, which means that the availability of the global sequence topic is strongly related to the availability of a single copy, and it does not have the ability to expand.
Partition sequence message: All messages are partitioned according to Sharding Key. Messages in the same partition are published and consumed in strict FIFO order. Sharding Key is a key field used to distinguish different partitions in the sequence message.

The following figure shows the application scenario of the partition sequence message, and the order ID is the Sharding Key of the sequence message at this time.

It can be seen that both the global sequence message and the partition sequence message rely on the natural FIFO characteristics of a single partition to ensure the order, so the order can only be guaranteed in the same partition. When the copy of this partition is not available, the order The message does not have the ability to retry to other copies, and the order of the message is difficult to guarantee at this time.

In order to solve this problem, we designed and implemented high-availability sequential messages.

Highly available sequential messages have the following characteristics:

There are multiple physical partitions under a logical sequential partition (PartitionGroup).
If any one of the physical partitions is writable, then the entire logical partition is writable and orderly.
We designed a set of sorting algorithms based on partition sites based on the happened-before principle.
According to this algorithm, when consumers consume a logical partition, they will pull messages from each physical partition to which they belong and merge and sort them to obtain the correct message sequence flow.

Through this design, the high-availability sequence message solves the following problems:

Availability issue: High-availability sequential messages will have the same availability as ordinary messages. When a copy is unavailable, it can be quickly retried to other copies.
Scalability issues: ordinary sequence messages, especially ordinary global sequence messages, do not have good scalability and can only be fixed in specific copies. The logical sequence partition of the high-availability sequence message can disperse the physical sequence partition in multiple copies.
Hot issues: Ordinary sequential messages hash a type of message into the same partition according to the key. Hot keys will cause hot partitions. High-availability sequential messages have horizontal expansion capabilities. Multiple physical partitions can be added to logical sequential partitions to eliminate hot issues.
Single point problem: Ordinary global sequence message contains only a single partition, and it is extremely prone to single point of failure. High-availability sequence message can eliminate the single point problem of global sequence message.

Special attention should be paid to the hot issues. When an e-commerce business within Alibaba was promoted, due to the excessive number of a particular ShardingKey sent to the sequential topic, a replica in the cluster received a large number of messages of the ShardingKey, resulting in the replica Exceeding the upper limit of its load caused the delay and accumulation of messages, which affected the business to a certain extent. After using high-availability sequential messages, due to its load balancing feature in multiple physical partitions, the carrying capacity of the cluster sequential messages is improved, thereby avoiding the emergence of hot issues.

Precise timing messages in seconds

Timed messages refer to messages that the client currently sends but hopes to receive in a certain time in the future. Timing messages are widely used in various scheduling systems or business systems. For example, when a payment order is paid, a payment message is generated. The system usually needs to process the message after a certain period of time to determine whether the user has paid successfully, and then the system does the corresponding processing.

The open source version of RocketMQ only supports a few specified delay levels, and does not support timing messages with second-level precision. For the diverse needs of the group and on the cloud, the open source version of the timing message cannot meet our needs, so we launched the second-level accurate timing message.

As shown in the figure below, we designed and implemented a second-level precise timing message that supports any timing time based on the time wheel, while meeting the following characteristics:

Any timing
Long timing time
Massive timing messages
Delete timed messages
High availability
high performance

An internal user has such a scenario and expects to process such a timing request at 30s in a certain minute in the future. The open source version of the timing message does not meet his needs, and the second-level accurate timing message guarantees high availability and high performance at the same time. , To meet its business needs.

Distributed Transaction Message

As shown in the figure below, in traditional transaction processing, the interaction between multiple systems is coupled to one transaction, resulting in a long overall response time and a complex rollback process, which potentially affects the availability of the system; while the distribution provided by RocketMQ The distributed transaction function realizes distributed transaction under the premise of ensuring the loose coupling of the system and the final consistency of the data.

The transaction message processing steps provided by the message queue RocketMQ are as follows:

The sender sends the semi-transactional message to the RocketMQ version of the message queue server.
After the message queue RocketMQ version server successfully persists the message, it returns an Ack to the sender to confirm that the message has been sent successfully. At this time, the message is a semi-transactional message.
The sender begins to execute local transaction logic.
The sender submits a second confirmation (Commit or Rollback) to the server based on the execution result of the local transaction. The server will mark the semi-transactional message as deliverable when receiving the Commit status, and the subscriber will eventually receive the message; the server will receive it The Rollback state deletes the semi-transactional message, and the subscriber will not accept the message.

Based on this implementation, we have implemented distributed transaction characteristics through messages, that is, the execution results of local transactions will ultimately reflect whether the subscriber can receive the message.

The distributed transaction messages of the message queue RocketMQ are widely used in Alibaba core transaction links. Through distributed transaction messages, the smallest transaction unit is realized; the transaction system and the message queue form a transaction processing; the downstream system (shopping cart , Integral, and others) are isolated from each other and processed in parallel.

Hierarchical storage

background

With the increasing number of customers on the cloud, storage has gradually become an important bottleneck for RocketMQ operation and maintenance. This includes but is not limited to:

The memory size is limited, and the server cannot cache all user data in the memory; in a multi-tenant scenario, when a user pulls cold data, it will put a lot of IO pressure on the disk, which will affect other users in the shared cluster. There is an urgent need to separate hot and cold data.
There is a demand for customized message storage duration for single tenants on the cloud. The messages of all users in RocketMQ Broker are stored in a continuous file, and the storage duration cannot be customized for any single user, that is, the existing storage structure cannot meet such needs.
If a lower-cost storage method for massive data can be provided, the disk storage cost of RocketMQ on the cloud can be greatly reduced.

Based on the above status quo, a hierarchical storage scheme came into being.

Architecture

The overall structure of tiered storage is as follows:

The connector node is responsible for real-time synchronization of messages on the broker to OSS
The historyNode node forwards the user's pull request for cold data to OSS
In OSS, the file structure is organized according to the Queue granularity, that is, each Queue will be stored by an independent file, thus ensuring that we can define the storage duration of messages for tenants.

Through this design, we have realized the cold and hot separation of message data.

scenes to be used

Based on hierarchical storage, we have further expanded the user's usage scenarios:

Custom storage time: After the cold and hot separation of message data, we store the cold data in a storage system such as OSS to achieve user-defined storage time.
Message audit: After the storage of messages has been expanded from several days to customization, the attributes of the message have changed from a temporary transfer data to the user's data asset, and the message system has also changed from a data hub to a data warehouse; users Able to implement more diverse audit, analysis, and processing functions based on the data warehouse.
Message playback: In the streaming computing scenario, message playback is a very important scenario; after extending the storage time of messages, streaming computing can realize richer computing and analysis scenarios.

stability

The stability of the message queue RocketMQ is an important area of our consistent, continuous and stable investment. Before introducing our latest work on stability, let's first review the evolution path of RocketMQ's high-availability architecture.

High-availability architecture evolution route

In 2012, RocketMQ came out as Alibaba's new generation of messaging engine, and then open sourced to the community, and the first generation of RocketMQ high-availability architecture was born. As shown in the figure below, the first-generation high-availability architecture adopts the popular Master-Slave architecture at the time. The write traffic is synchronized to the Slave node through the Master node, and the read traffic also passes through the Master node and the consumption records are synchronized to the Slave node. When the Master node is unavailable, the entire copy group is readable but not writable.

In 2016, RocketMQ cloud products officially began to be commercialized. Single points of failure occurred frequently in the cloud era. Cloud products need to be designed completely for failure. Therefore, RocketMQ launched the second-generation multi-copy architecture, relying on Zookeeper's distributed lock and notification mechanism. The Controller component is introduced to monitor the Broker status and switch between the main and standby state machines. When the main is unavailable, the standby automatically switches to the main. The second-generation architecture is the core high-availability architecture in the process of large-scale messaging cloud products, and has made great contributions to the large-scale cloud products.

In 2018, the RocketMQ community was very enthusiastic about the introduction of distributed protocols by Paxos and Raft. The RocketMQ R&D team launched a Dledger storage engine based on the Raft protocol in the open source community, which natively supports multiple copies of Raft.

The RocketMQ high-availability architecture has gone through three generations. In the practice of various scenarios in the group, public cloud and proprietary cloud, we found that these three high-availability architectures have some drawbacks:

The first-generation active/standby architecture only played a role in cold standby, and manual intervention was required for active/standby switching. In large-scale scenarios, there was a large waste of resources and operation and maintenance costs.
The second-generation architecture introduces Zookeeper and Controller nodes, and the architecture is more complex. The switchover between active and standby is automated, but the failover time is longer, usually about 10 seconds to complete the master selection.
The third-generation Raft architecture has not yet been applied on a large scale in the cloud or within the Alibaba Group, and the Raft protocol determines the need to choose the master, and the new master needs to be discovered by the client routing. The entire failover time is still long; in addition, strong The consistent Raft version does not support a flexible downgrade strategy and cannot make a flexible trade-off between availability and reliability.

In order to cope with the ever-increasing business scale, stricter SLA requirements, and complex and changeable proprietary cloud deployment environments on the cloud, the current messaging system requires a simple architecture, simple operation and maintenance, and a solution based on the current architecture. We call it a second-level RTO multiple copy architecture.

A new generation of second-level RTO multi-copy architecture

The second-level RTO multi-copy architecture is a new generation of high-availability architecture designed and implemented by the messaging middleware team. It includes a copy composition mechanism, a failover mechanism, and intrusive modifications to existing components.

The entire copy group has the following characteristics:

Strong Leader/No Election: Leader is determined at the time of deployment that there will be no handover during the entire life cycle, but it can be replaced in the event of a failure.
Only Leader supports message writing: only the leader accepts message writing for each copy group. When the leader is unavailable, the entire copy group cannot be written.
All copies support message reading: Although the leader has a full amount of messages and the amount of messages on Follower is not equal, all copies support message reading.
Flexible number of copy groups: The number of copy groups can be freely selected based on reliability, availability, and cost.
Flexible Quorum number: Eventually all messages will be synchronized to the entire copy group, but the minimum number of successful write copies can be flexibly configured in the copy group. For example, in the 2-3 mode, in the case of 3 copies, if the 2 copies are successful, the write is successful. At the same time, the number of Quorum can be dynamically downgraded by itself when the copy is unavailable.

Under the concept of the aforementioned copy group, failover can be accomplished by reusing the current RocketMQ client mechanism. As shown below:

Producer flexibly and quickly switch to another copy group when the master is unavailable.
Consumers can quickly switch to another copy of the same copy group for message consumption when a copy is unavailable.

Observability

Healthy market

We have also done a lot of work in observability, and provide users with a message system of observable health data. As shown in the figure below, users can clearly see various monitoring data at instance level, topic level, and group level, and can monitor and diagnose problems in all aspects.

Message link tracking

In addition, we also provide a full-link trajectory tracking function for the message based on the message trajectory. As shown in the figure below, users can see the complete message life cycle on the console, from message sending, storage, to consumption, and the entire link can be completely recorded.

Application scenario

Customer pain points: Users who have accumulated consumption in their business need to sample data based on the message trajectory, and after a comprehensive analysis, can they roughly determine the cause of the problem, and it is difficult to troubleshoot.

Core value: Improve the efficiency of online troubleshooting and the accuracy of problem positioning. Quickly find Topic and Group with the highest risk directly on the healthy market, and quickly locate the cause based on the changes in each indicator. For example, if the message processing time is too long, you can expand the consumer's machine or optimize the consumer business logic. If the failure rate is too high, you can quickly check the log to eliminate the cause of the error.

Event driven

You must be very familiar with Gartner. In an evaluation report in 2018, Gartner listed the Event-Driven Model as one of the top 10 strategic technology trends in the future, and made two predictions:

In 2022, more than 60% of new digital business solutions will adopt the software model of event notification.
In 2022, more than 50% of commercial organizations will participate in the EDA ecosystem.

In the same year, the CNCF Foundation also proposed CloudEvents, aiming to standardize event communication protocol standards between different cloud services. So far, CloudEvents has also released a number of binding specifications for message middleware.

It can be seen that event-driven is an important trend in future business systems, and messages are naturally close to events. Therefore, the message queue RocketMQ firmly embraces event-driven.

Speaking of messages and events, here is a brief explanation: Messages and events are two different forms of abstraction, which also mean to meet different scenarios:

Messages: Messages are a more general abstraction than events. They are often used for asynchronous decoupling between microservice calls. Microservice calls often need to wait until the service capabilities are not equal before they can asynchronously transform service calls through messages; messages The content of is often bound with strong business attributes, and the sender of the message has a clear expectation of the message processing logic.
Events: Events are more concrete than messages, representing the sending, conditions, and state changes of things; event sources come from different organizations and environments, so the event bus naturally needs to be cross-organized; event sources have nothing to do with how the event will be responded to It is expected that the application architecture that uses events is more thoroughly decoupled, and the application architecture that uses events will be more scalable and flexible.

In 2020, Alibaba Cloud released the event bus EventBridge. Its mission is to serve as the hub of cloud events, connecting cloud products and cloud applications with the standardized CloudEvents 1.0 protocol, providing centralized event governance and driving capabilities, and helping users easily Build a loosely coupled and distributed event-driven architecture. In addition, there are a large number of SaaS services in the vertical field in the cloud market outside of Alibaba Cloud. EventBridge will have excellent cross-product, cross-organization, and cross-cloud integration and integration capabilities. Help customers create a complete, event-driven, efficient and controllable new interface for cloud migration.

With the event source function provided by the event bus EventBridge, we can open up the link from the message to the event, so that the message queue RocketMQ has event-driven power, thereby embracing the entire event ecology. Next, we will use a case, as shown in the figure below, to show you this feature.

Create a message queue RocketMQ topic

Create target service

We quickly create an event-driven service based on the container service, and calculate the yaml of the load deployment as follows. The service can respond to events and print the results to standard output.

apiVersion: apps/v1 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
  name: eventbridge-http-target-deployment
  labels:
    app: eventbridge-http-target
spec:
  replicas: 2
  selector:
    matchLabels:
      app: eventbridge-http-target
  template:
    metadata:
      labels:
        app: eventbridge-http-target
    spec:
      containers:
      - name: eb-http-target
        # 下述镜像暴露了一个 HTTP 地址(/cloudevents)用于接收 CloudEvents，源码参考：https://github.com/aliyuneventbridge/simple-http-target
        image: registry.cn-hangzhou.aliyuncs.com/eventbridge-public/simple-http-target:latest
        ports:
        - containerPort: 8080

Go to the container service console, enter the service page of services and routing, create a service of the private network access type, and do a good job of port mapping.

Create event bus EventBridge custom bus

We come to the EventBridge console of the event bus and create a custom bus demo-with-k8s.

Create event bus EventBridge custom bus rules

We create a rule for the bus demo-with-k8s, and select HTTP as the event target, select the VPC type, select the corresponding VPC, VSwitch, and security group, and specify the target URL, as shown in the following figure:

Create event bus EventBridge event source

We add a custom event source of the message queue RocketMQ version to this custom event bus.

Send RocketMQ message

Next, we return to the message queue RocketMQ console, and send a message with the content hello eventbridge to the corresponding topic through the quick experience message production function of the console.

Next, we can find that this RocketMQ message is delivered to the corresponding service in the form of CloudEvent, and we have opened up the link from the message to the event. At the same time, based on the hierarchical storage function we mentioned above, the message queue RocketMQ has been transformed into a data warehouse that can continuously provide events, providing a broader scene for the entire event ecology.

Event-driven is an important trend in future business organizations and business systems, and the message queue RocketMQ will firmly embrace this trend and integrate messages into the event ecosystem.

Summarize

We selected several product profiles of the message queue RocketMQ, ranging from multiple message types, hierarchical storage, stability, observability, and future-oriented event-driven, combined with the comparison with open source RocketMQ, and analysis of specific application scenarios. Showed everyone the best practices of cloud-based large-scale distributed applications based on the message queue RocketMQ.