Troubleshooting and Analysis of Zero Traffic Loss of Resource Groups in Kafka&#39;s Trillion-level Message Practice

Author: vivo internet server team - Luo Mingbo

1. Kafka cluster deployment architecture

In order to allow readers to have a better resonance with the editor in the follow-up problem analysis, the editor will first align the deployment architecture of our Kafka cluster and the process of accessing the Kafka cluster with our readers and friends.

In order to avoid super large clusters, we split the entire Kafka cluster responsible for 10 trillion messages per day into multiple Kafka clusters according to the business dimension. If the granularity of splitting is too coarse, a single cluster will be too large, and the stability and availability of the cluster will be affected easily due to sudden changes in traffic, resource isolation, and speed limit. Less risk-resisting ability to deal with emergencies.

Because Kafka data storage and services are on the same node, the cluster expansion and contraction cycle is long, and the cluster expansion cannot be quickly implemented to withstand the business pressure when encountering sudden traffic. Therefore, we consider the business dimension and the importance of the data and whether it affects commercialization. The Kafka cluster is split in equal dimensions, and a layer of logical concept "resource group" is added to the Kafka cluster. Node nodes in a resource group are shared, and node resources between resource groups and resource groups are isolated from each other to ensure that failures occur. There will be an avalanche effect.

2. Business access to Kafka cluster process

Register a business project with the Kafka platform.
If the business data of the project is more important or directly affects the commercialization, the user needs to apply for the creation of an independent resource group for the project. If the project data volume is small and the data integrity requirements are not so high, the public resource group provided by the cluster can be used directly without application. resource group.
Items are bound to logical concept resource groups.
To create a topic, use the interface provided by the Kafka platform to create a topic. Strictly abide by the partition distribution of the topic, which can only be on the broker node managed by the resource group bound to the project.
Authorize read and write operations on topic.

Through the above-mentioned architecture deployment introduction and access process access introduction, I believe that everyone has a lot of relevant knowledge points that are aligned with the editor.

From the deployment architecture diagram, we can clearly understand that our cluster is deployed on the server. The smallest resource isolation unit is "resource group", that is, it will affect multiple broker nodes under the same resource group. Different resource groups The lower broker nodes are logically isolated.

After the above relevant knowledge points are aligned, we will start our troubleshooting journey.

3. Introduction to the failure situation

When a fault occurs, the traffic of multiple topics in the resource group where the faulty node is located almost all drops to zero. In the production environment, we have alerted the disk indicators READ, WRITE, IO.UTIL, AVG.WAIT, READ.REQ, and WRITE.REQ of the Kafka cluster. Monitoring, because the fault occurred in the early morning, the entire fault processing process continued to be practiced for a long time, resulting in the overall loss of topic traffic on the business side for a long time, which caused a considerable impact on the business.

Fourth, the introduction of monitoring indicators

4.1 Traffic monitoring situation

1. When the fault occurs, the network idle rate of the faulty node briefly drops to zero, which is consistent with the production flow monitoring indicators. Once the production traffic rises, the network idle rate of the failed node is synchronized to zero.

2. Almost all topic production traffic in Grafana monitoring indicators drops to zero.

3. The project monitoring of the Kafka platform also reflects that the production flow indicators of multiple topics of the current project drop to zero.

4.2 Disk metrics monitoring

The IO.UTIL indicator of the SDF disk reaches 100%, and about 80% is considered to be the indicator threshold for the stable operation of the service.

The AVG.WAIT indicator of the SDF disk reaches the minute-level wait. Generally, the delay of about 400ms is considered to be the threshold for the stable operation of the service.

4.3 Kafka server log and system log

The error log of Input/Output error appears in the log of the controller node of the Kafka cluster.

Error log of Buffer I/O error in Linux system log

Five, fault conjecture and analysis

From the above indicator monitoring, it can be clearly concluded that the cause of the failure is caused by the disk failure of the sdf disk of the Kafka broker node. It can be recovered by simply kicking the sdf disk off and restarting the corresponding Kafka broker node. Is that the end of it? of course not.

Small partners who have a certain knowledge of Kafka should know that when creating a topic, the partitions of the topic are evenly distributed to different broker nodes in the cluster. Even if one of the internal broker nodes fails, other partitions should be able to produce and consume normally. If If other partitions can perform normal production and consumption, there should be no situation where the traffic of the entire topic drops almost to zero.

As shown in the figure above, the three partitions of topicA are distributed on three physical host nodes, brokerA, brokerB, and brokerC, respectively.

When the producer sends a message to TopicA, it will establish a long link with the three physical host nodes of brokerA, brokerB, and brokerC respectively to send the message. At this time, if the brokerB node fails and cannot provide services to the outside world, according to our conjecture, it should not be affected. The two nodes, brokerA and brokerC, continue to provide the producer with the service of receiving messages.

However, from the data display of monitoring indicators to analyze that when the brokerB node fails, the overall topic traffic drops to zero, which is very different from our conjecture.

Since an effect similar to the service avalanche has caused the overall traffic of some topics to drop to almost zero, when we guess the cause of the problem, we can think in the direction of resource isolation and see what else is involved in the whole process. Guess the link of resource isolation.

On the Kafka server side, we have implemented the logical isolation of Kafka brokers in the way of resource groups, and it can be seen from the Grafana monitoring that the traffic of some topics has not dropped to zero seriously, so we temporarily shift the focus of analyzing the problem to the Kafka client side , to analyze whether there is resource isolation in the process of sending messages by the Kafka producer, which leads to the overall avalanche effect.

6. Partitioning rules for Kafka's default partitioner

Students who have a certain understanding of the production process of Kafka must know that Kafka, as a message middleware for massive data in the big data ecosystem, in order to solve the concurrency problem of massive data, Kafka adopted the client buffer message at the beginning of its design. Bulk messages are sent after a certain batch.

Send batches of data to the Kafka server through one network IO. Regarding the design of the Kafka producer client buffer, the editor will conduct an in-depth exploration in a separate chapter in the future. In view of the space problem, it will not be analyzed in detail here.

Based on the analysis here, we can have the following conjectures about the fault tolerance scheme when a batch of messages is sent to a faulty node:

Fail fast, record the information of the failed node. The next time the message is routed, it will only be routed to healthy nodes . Quickly free message buffer memory.
Fast failure, record the information of the faulty node, and report an error directly when the message is routed to the faulty node next time , and quickly release the buffer memory.
Waiting for a timeout, when the message waits for a timeout, the next time the message is routed will still be routed to the faulty node , and the occupied resources will be released after each wait for the timeout period.

In the above conjecture, if it is the first case , then every time the message is routed to only healthy nodes, there will be no avalanche effect that exhausts the client buffer resources;

In the second case , when the message is routed to the faulty node, directly refusing to allocate buffer resources will not cause an avalanche effect;

In the third case , the client buffer resources occupied by the faulty node can only be released after one or more timeouts. In the scenario of massive message sending, the messages on the faulty node within one timeout period are enough to send the client Side buffer resources are exhausted, causing other available partitions to fail to allocate client buffer resources, resulting in an avalanche effect.

With the above conjecture, open the source code of kafka client producer and analyze the partition rules of defaultPartitioner to get the following allocation logic:

Whether a partition is specified when sending a message, if a partition is specified, the message will be sent directly to the partition without rerouting the partition.
Whether the message specifies a key, if the message specifies a key, use the hash value of the key and the number of partitions of the topic to perform modulo operation to obtain the partition number of the message routing (corresponding to the third conjecture).
The message does not specify a partition nor a key. Use the self-incrementing variable to perform modulo operation on the available partition of the topic to obtain the partition number of the message routing (corresponding to the first conjecture).

7. Summary

It is analyzed from the source code that if the key is specified when the message is sent, and the default partition allocator of the Kafka producer is used, the Kafka producer client buffer resources will be exhausted and the avalanche effect of all partitions of the topic will occur.
I learned from the business system students that their sending logic does specify the key in message sending and uses the default partition allocator of Kafka producer.
The problem is justified.

8. Recommendations

Do not specify a key when sending a message when it is not necessary, otherwise an avalanche effect may occur in all partitions of the topic.
If you really need to send a message with a specified key, it is recommended not to use the default partition allocator of Kafka producer, because using the default partition allocator of Kafka producer in the case of specifying a key will cause an avalanche effect.

9. Thinking about Extended Problems

Why does the default partition allocator provided by Kafka producer use all partitions of the topic for modulo operation when the key is specified, while the self-increment variable and the available partition are used for the modulo operation when the key is not specified?
The problems analyzed in the article are that the granularity of the client buffer is at the producer instance level, that is, can a producer share a memory buffer to adjust the granularity of the buffer to the partition level?

Regarding the thinking and analysis of this series of issues, we will talk about it in subsequent articles, so stay tuned.

Troubleshooting and Analysis of Zero Traffic Loss of Resource Groups in Kafka's Trillion-level Message Practice

1. Kafka cluster deployment architecture

2. Business access to Kafka cluster process

3. Introduction to the failure situation