Design and Implementation of High-availability Delay Queue


Delay queue: a message queue delay function

  1. Delay → an uncertain time in the future
  2. mq → consumption behavior is sequential

With this explanation, the whole design becomes clear. Your purpose is to delay, and the carrying container is mq.


List the possible scenarios in my daily business:

  1. Establish a delayed schedule and need to remind the teacher to go to class
  2. Delayed push → push the announcements and assignments needed by the teacher

In order to solve the above problems, the simplest and most direct way is to scan the table regularly:

When the service starts, start an asynchronous coroutine → scan msg table regularly, and call the corresponding handler when the event triggers

Several disadvantages:

  1. Every service that requires timed/delayed tasks requires an msg table for additional storage → storage and business coupling
  2. Timing scan → The time is not well controlled, and the trigger time may be missed
  3. It is a burden on msg table instance. There is a service repeatedly that continuously exerts continuous pressure on the database

What is the biggest problem?

scheduling model is basically unified, do not do repeated business logic

We can consider extracting the logic from the specific business logic and turning it into a public part.

And this scheduling model is delay queue .

In fact, to put it plainly:

delay queue model is to store the events to be executed in the future in advance, and then continuously scan the storage, and execute the corresponding task logic when the execution time is triggered.

So is there a ready-made solution in the open source world? The answer is yes. Beanstalk ( ) It basically meets the above requirements

aim of design

  1. Consumer behavior at least
  2. High availability
  3. real-time
  4. Support message deletion

Let's talk about the design direction of the above-mentioned purposes in turn:

consuming behavior

This concept is taken from mq. Several directions of consumer delivery are provided in mq:

  • at most once → At most once, the message may be lost, but it will not be repeated
  • at least once → At least once, the message will certainly not be lost, but it may be repeated
  • exactly once → Yes and only once, the message will not be lost or repeated, and it will only be consumed once.

exactly once is guaranteed at both ends of producer + consumer as much as possible. When the producer cannot guarantee that, the consumer needs to do a de-duplication before consumption, so that the consumption will not be repeated after consumption. This is directly guaranteed in the delay queue.

The simplest: uses redis setNX to reach the only consumption of job id

High availability

Support multi-instance deployment. After an instance is suspended, there are backup instances that continue to provide services.

This externally provided API uses the cluster model, which encapsulates multiple nodes internally, and stores redundantly among multiple nodes.

Why doesn't

Considering storage solutions based on message queues such as kafka/rocketmq, and finally giving up such choices from the storage design model.

For example, suppose that Kafka is a message queue storage to realize the delay function, and the time of each queue needs to create a separate topic (such as: Q1-1s, Q1-2s..). This design is not too problematic in scenarios where the delay time is relatively fixed, but if the delay time changes relatively large and the number of topics is too large, it will change the disk from sequential read and write to random read and write, which will cause performance degradation. At the same time, it will also bring other problems like restarting or too long recovery time.

  1. Too many topics → storage pressure
  2. Topic stores real time, reads at different times (topics) during scheduling, sequential read → random read
  3. Similarly, when writing, write sequentially → write randomly

Architecture design

API design


  1. producer.At(msg []byte, at time.Time)
  2. producer.Delay(body []byte, delay time.Duration)
  3. producer.Revoke(ids string)


  1. consumer.Consume(consume handler)

After using the delay queue, the overall structure of the service is as follows, and the state transition of the job in the queue:

  1. service → producer.At(msg []byte, at time.Time) → insert a delayed job into the tube
  2. Timing trigger → job status is updated to ready
  3. The consumer gets the ready job → takes out the job and starts to consume; and changes the status to reserved
  4. Execute the handler logic processing function passed into the consumer

Production Practice

Mainly introduce what specific functions of the delay queue we use in daily development.

Production side

  1. Production delay tasks development, 1614a801fda548 only needs to determine the task execution time

    1. Incoming At() producer.At(msg []byte, at time.Time)
    2. The time difference will be calculated internally and inserted into the tube
  2. If the task time is modified, and the task content is modified

    1. In production, it may be necessary to create an additional logic_id → job_id relational table
    2. Query job_id → producer.Revoke(ids string) , delete it, and then insert it again

Consumer side

First of all, the framework level guarantees the exactly once the consumption behavior, but the upper-level business logic consumption fails or network problems occur, or various problems lead to consumption failure, and all the details are handed over to business development. Reasons for this:

  1. The framework and basic components only guarantee the correctness of the flow of job status
  2. The consumer side of the framework only guarantees the uniformity of consumer behavior
  3. Delayed tasks do not behave uniformly in different businesses

    1. Emphasizes the necessity of the task, and when the consumption fails, it needs to keep retrying until the task is successful
    2. Emphasize the punctuality of the task, if the consumption fails, you can choose to discard if you are not sensitive to the business

Here is a description of how the consumer side of the framework guarantees the uniformity of consumer behavior:

Divided into cluster and node. cluster :
  1. Inside the cluster, the consume handler is made a layer and then encapsulated
  2. the consume body and use this hash as the redis deduplication key
  3. If it exists, it will not be processed and discarded

  1. The consuming node gets the ready job; execute Reserve (TTR) first, subscribe to this job, and execute the job for logical processing
  2. Delete(job) in node; then consume

    1. fails, it will be thrown up to the business layer and retry

So for the consumer side, developers need to realize the idempotence of consumption by themselves.

project address

go-queue is implemented based on go-zero go-zero on github. Used by has 300+, and open source gets 11k+ stars in one year.

Welcome to use and star support us!

WeChat Exchange Group

Follow the " Practice " public account and click on the exchange group get the community group QR code.

阅读 1.3k


537 声望
3.2k 粉丝
0 条评论


537 声望
3.2k 粉丝