The interviewer asked me why the message queue lost messages? Fortunately, I summed up a full set of eight-part essays

A middle-aged man with a beer belly, a plaid shirt, and a severely receding hairline, holding a thermos cup in his hand and a MacBook between his arms, is walking towards you. He looks like an architect.

The interview begins, straight to the point.

Interviewer: I saw on your resume that the project uses message queues and kafka. Have you ever encountered a situation where messages are lost in message queues?

Me: [Question] Can message queues still lose messages? Who still uses message queues! Are you mistaken? I haven't encountered lost messages and haven't thought about it.

Interviewer: Hmm... lad, it seems that there are some interview routines that you still don't quite understand. Come here for the interview today! Give your resume and I'll take you downstairs.

I go! What else is there for the interview?
Can you be a little less tricky and a little more sincere!
Do you have to memorize the eight-part essay before you can participate in the interview?
Well, I'll take a look at the eight-legged interview essay summarized by Yideng.

Me: The process of message queue sending and consuming messages is divided into three parts, production process, server persistence process, and consumption process, as shown in the figure below.

All three processes have the potential to lose messages.

Interviewer: Well, what is the specific reason for the loss of messages? How to prevent lost messages?

Me: Let me elaborate on this situation:

1. Lost messages in the production process

Reason for loss: Generally, it may be a network failure, which causes the message not to be sent.

Solution: Just re-send.

Because Kafka uses asynchronous sending of messages in order to improve performance. Only when we get the sending result can we ensure that the message is sent successfully.
There are two options to get the send result.

One is that kafka encapsulates the sending result in a Future object, and I can use the get method of Future to synchronously block and get the result.

 Future<RecordMetadata> future = producer.send(new ProducerRecord<>(topic, message));
try {
    RecordMetadata recordMetadata = future.get();
    if (recordMetadata != null) {
        System.out.println("发送成功");
    }
} catch (Exception e) {
    e.printStackTrace();
}

The other is to use kafka's callback function to get the return result.

 producer.send(new ProducerRecord<>(topic, message), new Callback() {
    @Override
    public void onCompletion(RecordMetadata metadata, Exception exception) {
        if (exception == null) {
            System.out.println("发送成功");
        } else {
            System.out.println("发送失败");
        }
    }
});

If sending fails, there are two retry options:

Manual retry <br>In the catch logic or else logic, call the send method again. What if it still doesn't work?
Build an exception message table in the database, store the failure message in the table, and then retry an asynchronous task to control the number of retries and the interval time.
auto-retry
Kafka supports automatic retry. The parameters are set as follows. When the cluster leader election fails or the number of followers is insufficient, it can automatically retry.
```
 # 设置重试次数为3
retries = 3
# 设置重试间隔为100ms
retry.backoff.ms = 100
```
Generally, we will not use kafka to automatically retry, because if the number of retries is exceeded, it will still return a failure, and we need to retry manually.

2. The server-side persistence process loses messages

In order to ensure performance, kafka uses asynchronous flushing. When we send a message successfully, the Broker node goes down before flushing, which will result in message loss.

Of course, we can also set the brush frequency:

 # 设置每1000条消息刷一次盘
flush.messages = 1000
# 设置每秒刷一次盘
flush.ms = 1000

Let's popularize the architecture model of Kafka cluster first:

A Kafka cluster consists of multiple brokers, and a broker is a node (machine).
A topic has multiple partitions, and each partition is distributed on different brokers, which can make full use of the performance of distributed machines. You only need to add machines and partitions when expanding capacity.

A partition has multiple replicas (replicas), one leader replica (primary replica) and multiple follower replicas (slave replicas), which are designed to ensure data security.

Both sending and consuming messages are on the leader, and the follower is responsible for regularly pulling messages from the leader. Only when the follower pulls the message back from the leader can the producer send the message successfully.

In order to speed up the performance of persistent messages, Kafka combines followers with better performance into an ISR list (in-sync replica), and followers with poor performance into an OSR list (out-of-sync replica), ISR+OSR=AR (assigned repllicas).
If a follower does not pull messages from the leader for a period of time and is too far behind the leader, it will be removed from the ISR and placed in the OSR.
If a follower catches up with the leader, it will be put back into the ISR.
If the leader fails, a follower will be selected from the ISR as the leader.

In order to improve the performance of persistent messages, we can make some settings:

 # 如果follower超过一秒没有向leader拉取消息，就把它移出ISR列表
rerplica.lag.time.max.ms = 1000
# 如果follower落后leader一千条消息，就把它移出ISR列表
rerplica.lag.max.messages = 1000

# 至少保证ISR中有3个follower
min.insync.replicas = 3

# 异步消息，不需要leader确认，立即给生产者返回发送成功，丢失消息概率较大
asks = 0
# leader把消息写入本地日志中，不会等所有follower确认，就给生产者返回发送成功，小概率丢失消息
asks = 1
# leader需要所有ISR中follower确认，才给生产者返回发送成功，不会丢失消息
asks = -1 或者 asks = all

3. Lost messages during consumption

There is a concept of offset in kafka. The consumer pulls the message from the partition, and the consumer needs to commit the offset after the local processing is completed, indicating that the consumption is complete, and the message will not be pulled next time.
Therefore, we need to turn off the configuration of automatic commit offset to prevent the service from going down after the consumer pulls the message, resulting in the loss of the message.

 enable.auto.commit = false

Interviewer: It's up to you, as far as you can sum up everything, I don't think it's so complete. Come to work tomorrow, with double salary.

Summary of the knowledge points of this article:

The article is continuously updated, and you can search for "One Light Architecture" on WeChat to read more technical dry goods as soon as possible.

The interviewer asked me why the message queue lost messages? Fortunately, I summed up a full set of eight-part essays

1. Lost messages in the production process

2. The server-side persistence process loses messages

3. Lost messages during consumption

Summary of the knowledge points of this article:

一灯架构

引用和评论

三道MySQL联合索引面试题，淘汰80%的面试者，你能答对几道

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性