Design and Implementation of High-availability Delay Queue

Delay queue: a message queue delay function

Delay → an uncertain time in the future
mq → consumption behavior is sequential

With this explanation, the whole design becomes clear. Your purpose is to delay, and the carrying container is mq.

background

List the possible scenarios in my daily business:

Establish a delayed schedule and need to remind the teacher to go to class
Delayed push → push the announcements and assignments needed by the teacher

In order to solve the above problems, the simplest and most direct way is to scan the table regularly:

When the service starts, start an asynchronous coroutine → scan msg table regularly, and call the corresponding handler when the event triggers

Several disadvantages:

Every service that requires timed/delayed tasks requires an msg table for additional storage → storage and business coupling
Timing scan → The time is not well controlled, and the trigger time may be missed
It is a burden on msg table instance. There is a service repeatedly that continuously exerts continuous pressure on the database

What is the biggest problem?

scheduling model is basically unified, do not do repeated business logic

We can consider extracting the logic from the specific business logic and turning it into a public part.

And this scheduling model is delay queue .

In fact, to put it plainly:

delay queue model is to store the events to be executed in the future in advance, and then continuously scan the storage, and execute the corresponding task logic when the execution time is triggered.

So is there a ready-made solution in the open source world? The answer is yes. Beanstalk ( https://github.com/beanstalkd/beanstalkd ) It basically meets the above requirements

aim of design

Consumer behavior at least
High availability
real-time
Support message deletion

Let's talk about the design direction of the above-mentioned purposes in turn:

consuming behavior

This concept is taken from mq. Several directions of consumer delivery are provided in mq:

at most once → At most once, the message may be lost, but it will not be repeated
at least once → At least once, the message will certainly not be lost, but it may be repeated
exactly once → Yes and only once, the message will not be lost or repeated, and it will only be consumed once.

exactly once is guaranteed at both ends of producer + consumer as much as possible. When the producer cannot guarantee that, the consumer needs to do a de-duplication before consumption, so that the consumption will not be repeated after consumption. This is directly guaranteed in the delay queue.

The simplest: uses redis setNX to reach the only consumption of job id

High availability

Support multi-instance deployment. After an instance is suspended, there are backup instances that continue to provide services.

This externally provided API uses the cluster model, which encapsulates multiple nodes internally, and stores redundantly among multiple nodes.

Why doesn't

Considering storage solutions based on message queues such as kafka/rocketmq, and finally giving up such choices from the storage design model.

For example, suppose that Kafka is a message queue storage to realize the delay function, and the time of each queue needs to create a separate topic (such as: Q1-1s, Q1-2s..). This design is not too problematic in scenarios where the delay time is relatively fixed, but if the delay time changes relatively large and the number of topics is too large, it will change the disk from sequential read and write to random read and write, which will cause performance degradation. At the same time, it will also bring other problems like restarting or too long recovery time.

Too many topics → storage pressure
Topic stores real time, reads at different times (topics) during scheduling, sequential read → random read
Similarly, when writing, write sequentially → write randomly

Architecture design

API design

producer

producer.At(msg []byte, at time.Time)
producer.Delay(body []byte, delay time.Duration)
producer.Revoke(ids string)

consumer

consumer.Consume(consume handler)

After using the delay queue, the overall structure of the service is as follows, and the state transition of the job in the queue:

service → producer.At(msg []byte, at time.Time) → insert a delayed job into the tube
Timing trigger → job status is updated to ready
The consumer gets the ready job → takes out the job and starts to consume; and changes the status to reserved
Execute the handler logic processing function passed into the consumer

Production Practice

Mainly introduce what specific functions of the delay queue we use in daily development.

Production side

Production delay tasks development, 1614a801fda548 only needs to determine the task execution time
1. Incoming At() producer.At(msg []byte, at time.Time)
2. The time difference will be calculated internally and inserted into the tube
If the task time is modified, and the task content is modified
1. In production, it may be necessary to create an additional logic_id → job_id relational table
2. Query job_id → producer.Revoke(ids string) , delete it, and then insert it again

Consumer side

First of all, the framework level guarantees the exactly once the consumption behavior, but the upper-level business logic consumption fails or network problems occur, or various problems lead to consumption failure, and all the details are handed over to business development. Reasons for this:

The framework and basic components only guarantee the correctness of the flow of job status
The consumer side of the framework only guarantees the uniformity of consumer behavior
Delayed tasks do not behave uniformly in different businesses
1. Emphasizes the necessity of the task, and when the consumption fails, it needs to keep retrying until the task is successful
2. Emphasize the punctuality of the task, if the consumption fails, you can choose to discard if you are not sensitive to the business

Here is a description of how the consumer side of the framework guarantees the uniformity of consumer behavior:

Divided into cluster and node. cluster :

https://github.com/tal-tech/go-queue/blob/master/dq/consumer.go#L45

Inside the cluster, the consume handler is made a layer and then encapsulated
the consume body and use this hash as the redis deduplication key
If it exists, it will not be processed and discarded

node：

https://github.com/tal-tech/go-queue/blob/master/dq/consumernode.go#L36

The consuming node gets the ready job; execute Reserve (TTR) first, subscribe to this job, and execute the job for logical processing
Delete(job) in node; then consume
1. fails, it will be thrown up to the business layer and retry

So for the consumer side, developers need to realize the idempotence of consumption by themselves.

project address

go-queue is implemented based on go-zero go-zero on github. Used by has 300+, and open source gets 11k+ stars in one year.

go-zero: https://github.com/zeromicro/go-zero
go-stash: https://github.com/tal-tech/go-queue

Welcome to use and star support us!

WeChat Exchange Group

Follow the " Practice " public account and click on the exchange group get the community group QR code.

Design and Implementation of High-availability Delay Queue

background

aim of design

consuming behavior

High availability

Why doesn't

Architecture design

API design

Production Practice

Production side

Consumer side

project address

WeChat Exchange Group

kevinwan

引用和评论

熔断原理分析与源码解读

一文掌握 MCP 上下文协议：从理论到实践

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

腾讯 tRPC-Go 教学——（1）搭建服务

Go-Zero实战：抽奖算法的设计与实现