Elegant fault handling: quickly create a Pulsar retry queue

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-machine room and cross-regional data replication, with strong consistency, high throughput and low latency, highly scalable streaming data storage characteristics.

This article comes from a community user contribution, the author Hou Shengxin, from Banyu.

In many online business systems, due to abnormal business logic processing, a message has not been confirmed, we need to be as prepared as possible to handle failures gracefully. Retrying is our common practice, generally we start from the following three aspects to retry:

Set up re-delivery. If you need to allow re-consumption of failed messages, we can configure the consumer to allow consumption of messages from the business topic and retry topic at the same time, and configure the consumer to automatically retry.
Set up the retry queue. If the message is not successfully consumed, it will be saved in the retry topic. And you can specify the delay time, and automatically re-consumption and retry the consumption failure message in the topic.
Limit on the number of retries. By default, if the consumer fails to consume a message (that is, the consumer cannot ack), it will retry the same message.

So, can't we simply let this default behavior take over everything and retry the message until it succeeds? The problem is that this message may never succeed. At least it will not succeed without some form of manual intervention. As a result, consumers will never continue to process any subsequent messages, and our message processing will be in trouble, so after a certain number of retries, the dead letter queue method will be stored as a confirmation success message.

As shown in the figure above, Pulsar uses non-blocking request retry queues and dead letter queues (DLQ) to extend the role of the existing event-driven architecture. Through this processing, we can achieve decoupling and observable error handling without interrupting real-time traffic. .

But in Pulsar, the automatic retry option is turned off by default. We can set the enableRetry option to true so that the consumer can retry. As shown in the following example, consumers will consume messages from the retry topic:

 package main
 
 
import (
    "context"
    "fmt"
    "github.com/apache/pulsar-client-go/pulsar"
    "time"
)
 
func main() {
 
 
    cp := pulsar.ClientOptions{
        URL: "pulsar://xxx.xxx.xxx.xxx:6650",
        OperationTimeout: 30 * time.Second,
    }
 
    client, err := pulsar.NewClient(cp)
    if err != nil {
        return
    }
    defer client.Close()
 
    d := &pulsar.DLQPolicy{
        MaxDeliveries: 3,
        RetryLetterTopic: "persistent://group/server/xxx-RETRY",
        DeadLetterTopic: "persistent://group/server/xxx-DLQ",
    }
 
    consumer, err := client.Subscribe(pulsar.ConsumerOptions{
        Topic: "persistent://group/server/xxx",
        SubscriptionName: "test",
        Type: pulsar.Failover,
        RetryEnable: true,
        DLQ: d,
        NackRedeliveryDelay: time.Second * 3,
    })
    if err != nil {
        return
    }
 
    ctx := context.Background()
    for {
        msg, err := consumer.Receive(ctx)
        if err != nil {
            return
        }
        if msg.Key() == 0 {
            // 确认的处理
            consumer.Ack(msg)
        } else {
            // 不确认，等 NackRedeliveryDelay 后将被重新投递到主队列进行消费
consumer.Nack(msg)
 
            // 稍后处理,等 xx 秒后将被重新投递到重试队列
consumer.ReconsumeLater(msg, time.Second * 5)
 
            // 以上方法二选其一
        }
    }
}

Retry queue

First of all, a retry queue is automatically created as in the above example, and one of two conditions is required to generate a retry message:

Nack() function, the consumer's Nack() function is used to confirm the failure of processing a single message. Once a message is "negatively acknowledged", it will be marked for re-delivery later. The delivery object is the current main topic, the number of delivery is not affected, and the delivery time is controlled by NackRedeliveryDelay.
AckTimeout parameter, due to network jitter, service Down machine, etc., failed to Nack in time. In order to improve the retry mechanism, Pulsar set the Acktimeout parameter to be 0 (not enabled) by default. Once the consumer processing exceeds Acktimeout, delivery will be retried. (In golang sdk v0.6.0 and before, the related functions of setting Acktimeout are not implemented, please continue to pay attention afterwards)

The retry behavior of the retry queue in the retry behavior is time-dependent. At present, it is mainly triggered by the consumer.ReconsumeLater() method. Once the retry queue is triggered, the number of retries will be reduced accordingly. The RetryLetterTopic in the DLQPolicy structure here is a topic created by Pulsar on the original basis for retrying. The default is: {TopicName}-{Subscription}-RETRY, which is to avoid interfering with the data of the main topic to the greatest extent.

The Golang SDK does not complete the rich and diverse retry mechanisms in the Java SDK, but it simply and rudely directly opens the parameters of the original delay time of NackRedeliveryDelay, which facilitates the customized development of various strategies.

Among them, the DLQPolicy.MaxDeliveries parameter will determine the maximum number of attempts to send when there is an error in the message. If the maximum value set by the user is reached, and the message has not been successfully sent, Pulsar will push the message to the dead letter queue at this time, which is DLQPolicy .DeadLetterTopic.

Note: ⚠️RLQ is a delay queue, shared mode for consumption!

Dead letter queue

When the number of retries runs out, the message will be routed to the dead letter queue. Note ⚠️: At this time, the message status will become confirmed. The dead letter queue is a non-partitioned persistent queue. Users can process information messages according to their own needs. The SDK provides the DLQPolicy.DeadLetterTopic parameter to set the name of the "dead letter queue". By default, the name of the dead letter queue is: {TopicName}-{Subscription}-DLQ.

Summarize

So far, let's sort out the process:
1. In addition to the normal consumption of written topics, a retry queue will be added for retrying, and the SDK will automatically subscribe to the retry queue;
2. The retry queue is actually a delay queue, and unacknowledged messages will maintain a time-related priority queue;

3. When the retrial is over, the message will enter the dead letter queue, the message status will be confirmed, and the user will consume the dead letter queue to process the dead letter message.

About the Author

My name is Hou Shengxin, or I can be Dayun. I am currently working in Banyu Infrastructure, responsible for the maintenance and related development of message queues. I am a member of the Rust Daily Report team and like to study storage and service governance. When I first came into contact with Pulsar, I was attracted by the structure of separation of storage and computing. The smooth producer-consumer access and high throughput made me curious about the implementation of this project, and I hope to make some contributions to Pulsar's related functions in the future.

Elegant fault handling: quickly create a Pulsar retry queue

About Apache Pulsar

Retry queue

Dead letter queue

Summarize

About the Author

Recommended reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

Apache Pulsar 技术系列 - 大规模延迟消息解析

vivo Pulsar万亿级消息处理实践（1）-数据发送原理解析和性能调优

XXL-MQ v1.4.0 | 轻量级分布式消息队列