The implementation of Kafka load balancing in vivo

vivo Internet Server Team-You Shuo

Replica migration is the most frequent operation of Kafka. For a cluster with hundreds of thousands of replicas, it is very difficult to manually complete replica migration. As an operation and maintenance tool for Kafka, Cruise Control includes functions such as online and offline of Kafka services, load balancing in the cluster, replica expansion and contraction, repair of missing replicas, and node downgrade. Obviously, the emergence of Cruise Control makes it easier for us to operate and maintain large-scale Kafka clusters.
Note: This article is based on Kafka 2.1.1.

1. Kafka load balancing

1.1 Producer Load Balancing

The Kafka client can use the partitioner to calculate the partition based on the key of the message. If the key is not specified when sending the message, the default partitioner will assign a partition to each message based on the round robin algorithm;

Otherwise, the hash value of the key will be calculated based on the murmur2 hash algorithm, and the last partition number will be modulo the number of partitions.

Obviously, this is not the Kafka load balancing we're talking about, because producer load balancing doesn't seem that complicated.

1.2 Consumer Load Balancing

Considering the situation of consumers going offline and changing the number of topic partitions, KafkaConsumer also needs to be responsible for interacting with the server to perform partition redistribution operations to ensure that consumers can consume topic partitions in a more balanced manner, thereby improving consumption performance;

Kafka currently has two mainstream partition allocation strategies (the default is range, which can be specified by the partition.assignment.strategy parameter):

Range: On the premise of ensuring balance, assign consecutive partitions to consumers, and the corresponding implementation is RangeAssignor;
round-robin: On the premise of ensuring balance, round-robin assignment, the corresponding implementation is RoundRobinAssignor;
Version 0.11.0.0 introduces a new partition assignment strategy, StickyAssignor, which has the advantage of maintaining the original partition assignment results as much as possible on the premise of ensuring partition balance, thereby avoiding many redundant partition assignment operations and reducing the execution of partition reassignment. time.

Whether it is a producer or a consumer, the Kafka client has already done load balancing for us, so do we still need to discuss load balancing? The answer is yes, because the main problem with Kafka's uneven load lies on the server side, not the client side.

2. Why does the Kafka server need load balancing

Let's first take a look at the traffic distribution of the Kafka cluster (Figure 1) and the traffic distribution of the cluster after the newly launched machine (Figure 2):

It can be seen from Figure 1 that the traffic distribution of each broker in the resource group is not very balanced, and since some topic partitions are concentrated on certain brokers, when the topic traffic suddenly increases, only some broker traffic will suddenly increase.

In this case, we need to expand the topic partition or manually perform the migration operation.

Figure 2 shows the traffic distribution of a resource group in our Kafka cluster after the expansion. The traffic cannot be automatically allocated to the newly expanded nodes. At this point, we need to manually trigger data migration so that traffic can be directed to the newly expanded node.

2.1 Kafka storage structure

Why does the above problem occur? This needs to start from the storage mechanism of Kafka.

The following figure is the storage structure of Kafka topic, and its specific hierarchical structure is described as follows:

Each broker node can specify multiple log directories through the logDirs configuration item. Our online machine has a total of 12 disks, and each disk corresponds to a log directory.
There will be several [topic]-[x] directories under each log directory. This directory is used to store the data of the specified partition of the specified topic. Correspondingly, if the topic is 3 copies, it will be stored on other broker nodes in the cluster. There are two directories with the same name as this directory.
The data written by the client to kafka will eventually generate .index, .timeindex, .snapshot and .log files in pairs in chronological order, and these files are saved in the corresponding topic partition directory.
In order to achieve high availability, our online topics are generally 2 copies/3 copies, and each copy of the topic partition is distributed on different broker nodes. Sometimes in order to reduce the risk of rack failure, topic partitions are different. Replicas are also required to be allocated on broker nodes in different racks.

After understanding the Kafka storage mechanism, we can clearly understand that the data written by the client to Kafka will be routed to different log directories of the broker according to topic partitions. As long as we do not manually intervene, the result of each routing will not be Change. Because the routing result will not change each time, then the problem comes :

As the number of topics continues to increase , and the number of partitions for each topic is inconsistent, eventually the topic partitions are unevenly distributed in the Kafka cluster .

For example: topic1 has 10 partitions, topic2 has 15 partitions, topic3 has 3 partitions, and our cluster has 6 machines. On the 6 brokers, there will always be 4 brokers with two topic1 partitions, 3 brokers with 3 topic3 partitions, and so on.

Such a problem will lead to the fact that the inbound and outbound traffic on a broker with many partitions may be higher than that on other brokers. If we have to consider that the traffic of different partitions of the same topic is inconsistent, and the traffic of different topics is inconsistent, plus we have 7000 topics online, 130,000 partitions, 270,000 replicas, and so on.

In such a complex situation, there will always be brokers with a particularly high load in the cluster, and some brokers with a very low load. When the broker load reaches a certain level, our operation and maintenance students need to intervene. We need to help these brokers. Reduce pressure, thereby indirectly improving the overall load capacity of the cluster.

When the overall load of the cluster is very high and the business traffic will continue to grow , we will expand the machines into the cluster. Some students want to expand the machine is a good thing, what is the problem? The problem is the same as the above, because the routing result of the data sent to the topic partition will not change. If there is no manual intervention, the traffic of the newly expanded machine will always be 0, and the original broker load in the cluster will still not be available. lighten.

3. How to load balance Kafka

3.1 Manually generate migration plan and migration

As shown in the figure below, we simulate a simple scenario, where T0-P0-R0 represents topic-partition-replica, assuming that the traffic of each partition of the topic is the same, and assuming that each partition R0 replica is the leader.

We can see that there are two topics T0 and T1, T0 is a 5-partition 2 replica (incoming and outgoing traffic is 10 and 5), T1 is a 3-partition 2 replica (incoming and outgoing traffic is 5 and 1), if the rack is strictly considered, The distribution of topic replicas may be as follows:

Suppose we now add a new broker3 (Rack2), as shown in the following figure: Since the distribution of topics on the racks has been considered before, the overall load of broker2 is higher.

We now want to migrate some partitions on broker2 to the newly expanded broker3. Considering factors such as racks, traffic, and the number of replicas, we will transfer T0-P2-R0, T0-P3-R1, T0-P4-R0 , T1-P0-R1 four partitions are migrated to broker3.

It doesn't seem to be very balanced. Let's switch the T1-P2 partition to the leader:

After some tossing and turning, the entire cluster is much more balanced. The commands for migrating replicas and switching leaders above are as follows:

Kafka replica migration script

 # 副本迁移脚本：kafka-reassign-partitions.sh
# 1. 配置迁移文件
$ vi topic-reassignment.json
{"version":1,"partitions":[
{"topic":"T0","partition":2,"replicas":[broker3,broker1]},
{"topic":"T0","partition":3,"replicas":[broker0,broker3]},
{"topic":"T0","partition":4,"replicas":[broker3,broker1]},
{"topic":"T1","partition":0,"replicas":[broker2,broker3]},
{"topic":"T1","partition":2,"replicas":[broker2,broker0]}
]}
# 2. 执行迁移命令
bin/kafka-reassign-partitions.sh --throttle 73400320 --zookeeper zkurl --execute --reassignment-json-file topic-reassignment.json
# 3. 查看迁移状态/清除限速配置
bin/kafka-reassign-partitions.sh --zookeeper zkurl --verify --reassignment-json-file topic-reassignment.json

3.2 Use load balancing tool - cruise control

After understanding the Kafka storage structure, manual intervention topic partition distribution, etc., we can see that Kafka is very cumbersome to operate and maintain. Are there any tools that can help us solve these problems?

The answer is yes.

cruise control is a project developed by LinkedIn to address the difficulty of Kafka cluster operation and maintenance. Cruise control can dynamically load balance various resources of the Kafka cluster, including: CPU, disk usage, incoming traffic, outgoing traffic, replica distribution, etc. , and cruise control also has functions such as preferred leader switching and topic configuration changes.

3.2.1 cruise cotnrol architecture

Let's briefly introduce the architecture of cruise control.

As shown in the figure below, it is mainly composed of Monitor, Analyzer, Executor and Anomaly Detector :

（来源：cruise control 官网）

(1) Monitor

Monitor is divided into client-side Metrics Reporter and server-side Metrics Sampler:

Metrics Reporter implements Kafka's metric reporting interface MetricsReporter, and reports native Kafka metrics to topic \_\_CruiseControlMetrics in a specific format.
Metrics Sampler obtains native metrics from \_\_CruiseControlMetrics and then aggregates them according to broker and partition-level metrics. The aggregated metrics include statistics such as brokers, the mean and maximum value of partition loads, and these intermediate results will be sent to topic \_ In \_KafkaCruiseControlModelTrainingSamples and \_\_KafkaCruiseControlPartitionMetricSamples;

(2) Analyzer

As the core part of cruise control, Analyzer generates migration plans based on optimization goals provided by users and the cluster load model generated by Monitor.

In cruise control, "optimization goals provided by users" include hard goals and soft goals. Hard goals are a type of goals that the Analyzer must meet during pre-migration (for example, replicas must meet the rack requirements after migration). The principle of decentralization), the soft target is the target to be achieved as much as possible. If a copy can only meet one of the hard targets and soft targets at the same time after migration, the hard target is the main one. If there is a hard target If it cannot be satisfied, the analysis fails.

Where Analyzer may need improvement:

Since Monitor generates the load model of the entire cluster, our Kafka platform divides the Kafka cluster into multiple resource groups, and the resource utilization of different resource groups is very different, so the native cluster load model is no longer suitable for our application. Scenes.
Most businesses do not specify keys for production, so the load deviation of each partition is not large. If topic partition replicas are evenly distributed within a resource group, then the resource group is also balanced.
The native cruise control will balance the work from the cluster dimension. After specifying the resource group, you can start the balance work from the resource group dimension, but it cannot meet the scenario of cross-resource group migration.

(3) Executor

Executor, as an executor, executes the migration plan analyzed by Analyzer. It will submit the migration plan to the Kafka cluster in batches in the form of interfaces, and then Kafka will execute the copy migration according to the submitted migration script.

Executor may need improvement:

When cruise control performs functions such as replica migration, it cannot trigger the switching of the cluster's preferred leader: sometimes a shutdown restart occurs during the cluster balancing process, and the faulty machine is used as the partition of the preferred leader, and its leader cannot be automatically switched back, causing other problems in the cluster. The node pressure increases sharply, and a chain reaction often occurs at this time.

(4) Anomaly Detector

Anomaly Detector is a scheduled task. It periodically detects whether the Kafka cluster is unbalanced or whether there are any abnormal situations such as missing replicas. When these situations occur in the Kafka cluster, Anomaly Detector will automatically trigger a load balancing within the cluster.

In the main function descriptions that follow, I will mainly introduce the processing logic of Monitor and Analyzer .

3.2.2 Balance the incoming and outgoing traffic of the broker / balance the machine online and offline

We have already introduced the reasons, schematic diagrams and solutions for the uneven traffic load among the brokers in the Kafka cluster, so how does cruise control solve this problem .

In fact, the idea of cruise control balancing the cluster is basically the same as our idea of manually balancing the cluster, except that it requires detailed indicator data of the Kafka cluster. resources for analysis to arrive at a final migration plan.

Take resources such as topic partition leader replicas as an example:

After the server receives the balancing request, the Monitor will first build a model that can describe the load distribution of the entire cluster based on the cached cluster indicator data.

The following figure briefly describes the generation process of the entire cluster load information. The smaple fetcher thread will load the acquired native metrics into a more readable Metric Sample, and further process it to obtain brokerid, partition, etc. Statistical indicators of information. These indicators are stored in the load attribute of the corresponding broker and replica, so the broker and replica will contain information such as traffic load, storage size, and whether the current replica is the leader.

Analyzer will traverse the brokers we specify (by default, all the brokers in the cluster). Since each broker and its underlying topic partition replicas have detailed indicator information, the analysis algorithm directly sorts the brokers based on these indicators and the specified resources.

The resource of this example is the number of topic partition leader replicas, and then the Analyzer will determine whether the number of leader replicas of a topic on the current broker needs to be increased or decreased according to the maximum/minimum thresholds, discrete factors, etc. we set in advance. If it is increased, change it clustermodel migrates the copy of the topic leader on the broker with high load to the current broker, and vice versa. In the following transformation points, we will briefly describe the working process of the Analyzer.

After traversing all the brokers and analyzing all the resources we specified, the final version of the clustermodel is obtained, and then compared with the clustermodel we originally generated, the topic migration plan is generated.

Cruise control will submit the topic migration plan to the kafka cluster for execution in batches according to our specified migration strategy.

The migration plan diagram is as follows:

3.2.3 Preferred leader switching

To switch the non-preferred leader copy, the migration plan diagram is as follows:

3.2.4 Topic configuration changes

To change the number of topic replicas, the migration plan diagram is as follows:

3.3 Retrofit cruise control

3.3.1 Specify resource groups for balancing

When the scale of the cluster is very large, it becomes very difficult for us to balance the entire cluster. It often takes half a month or even longer to balance once, which invisibly increases the pressure on our operation and maintenance students.

For this scenario, we have also transformed the cruise control. We logically divide the Kafka cluster into multiple resource groups, so that the business has its own resource group. When the traffic fluctuates in one business, it will not affect other business. Business.

By specifying resource groups, we only need to balance a small part or multiple parts of the cluster each time, which greatly shortens the balancing time and makes the balancing process more controllable.

The transformed cruise control can do the following:

Through the balancing parameters, we can only balance the brokers of one or more resource groups.
When changing the topic configuration, such as adding a topic copy, the newly expanded copy needs to be in the same resource group as the original copy of the topic.
In the resource group, analyze whether the resources on the broker are moved in or out. For each type of resource target, cruise control is to calculate the statistical indicators within the scope of the resource group, and then combine thresholds and discrete factors to analyze whether the broker is moving out of resources or moving in resources.

As shown in the figure below, we save the metadata of clusters, resource groups, and topics under the resource group in the database, and the Analyzer can perform balanced analysis on each broker according to the resource distribution target within the scope of the specified resource group.

For example: when doing balance analysis on broker-0, the Analyzer will traverse the goals list, each goal is responsible for a type of resource load target (cpu, inbound traffic, etc.), when the balance analysis reaches goal-0, goal-0 will Determine whether the load of broker-0 exceeds the upper threshold. If it exceeds, you need to migrate some topic replicas of broker-0 to brokers with lower load; otherwise, you need to migrate replicas on other brokers to broker-0.

Among them, the following recheck goals are the following goals. When doing balance analysis, before updating the cluster model, it will judge whether this migration will conflict with the previous goals. If there is a conflict, then the cluster model will not be updated, and the current goal will continue to try to migrate to other brokers until it finds a suitable migration target, and then update the cluster model.

3.3.2 Migrating topic/topic partitions to the specified broker

Consider these scenarios:

There are several resource groups under a project. Due to business changes, the business wants to migrate topics under resource group A to resource group B.
The business wants to migrate the topics of the public resource group to the C resource group.
After the balance is completed, it is found that there are always several topics/partitions that are not very evenly distributed.

Faced with these scenarios, the function of specifying resource groups for balancing above cannot meet our needs. Therefore, we can do the following for the cruise control after the transformation of the above scenario:

Balance only the specified topic or topic partition;
A balanced topic or topic partition is only migrated to the specified broker.

3.3.3 New target analysis - topic partition leader replica dispersion

Most of the business parties do not specify a key to send data, so the traffic and storage of each partition of the same topic are close, that is, when the leader copies of each partition of each topic are distributed as evenly as possible on the brokers of the cluster, The load on the cluster will be very even.

Some students will ask, the number of topic partitions is not always divisible by the number of brokers, so isn't the load of each broker still inconsistent in the end?

The answer is yes, only through the leader copy of the partition can not achieve the final balance.

According to the above scenario, the cruise control after the transformation can do the following:

A new type of resource analysis is added: topic partition leader replica dispersion.
First, ensure that the leader copy and follower copy of each topic are distributed as evenly as possible on the brokers of the resource group.
On the basis of 2, replicas will be distributed as much as possible to brokers with lower load.

As shown in the figure below, for each topic copy, Analyzer will sequentially calculate whether the number of topic leaders of the current broker exceeds the upper threshold. load, etc. to select the follower replica in AR as the new leader for switching. If there is no broker that meets the requirements in the AR replica, a broker other than the AR list will be selected.

3.3.4 Final equalization effect

The following figure shows the traffic distribution after a resource group is balanced. The traffic deviation between nodes is very small. In this case, the cluster's ability to withstand abnormal traffic surges can be enhanced, and the overall resource utilization and service stability of the cluster can be improved. ,cut costs.

3.4 Install/deploy cruise control

3.4.1 Client Deployment: Indicator Collection

[Step 1]: Create a Kafka account for later production and consumption of indicator data

[Step 2]: Create three Kafka internal topics: a is used to store the native jmx indicators of the Kafka service; b and c are used to store the partition and model indicators processed by cruise control respectively;

[Step 3]: Grant read/write and cluster operation permissions to the account created in step 1 for reading/writing the topic created in step 2;

[Step 4]: Modify kafka's server.properties and add the following configuration:

Configure the collection program on the Kafka service

 # 修改kafka的server.properties
metric.reporters=com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
cruise.control.metrics.reporter.bootstrap.servers=域名:9092
 
cruise.control.metrics.reporter.security.protocol=SASL_PLAINTEXT
cruise.control.metrics.reporter.sasl.mechanism=SCRAM-SHA-256
cruise.control.metrics.reporter.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"ys\" password=\"ys\";

[Step 5]: Add the jar package of cruise-control-metrics-reporter to the lib directory of Kafka: mv cruise-control-metrics-reporter-2.0.104-SNAPSHOT.jar kafka\_dir/lib/;

[Step 6]: Restart the Kafka service.

3.4.2 Server Deployment: Indicator Aggregation/Balanced Analysis

(1) Go to https://github.com/linkedin/cruise-control to download the zip file and unzip it ;

(2) Replace the jar package generated under your own local cruise control submodule with cruise control: mv cruise-control-2.0.xxx-SNAPSHOT.jar cruise-control/build/libs;

(3) Modify the cruise control configuration file, mainly focusing on the following configurations:

 # 修改cruise control配置文件
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-256
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"ys\" password=\"ys\";
bootstrap.servers=域名:9092
zookeeper.connect=zkURL

(4) Modify the database connection configuration:

 # 集群id
cluster_id=xxx  
db_url=jdbc:mysql://hostxxxx:3306/databasexxx
db_user=xxx
db_pwd=xxx

4. Summary

Through the above introduction, we can see that Kafka has two obvious defects:

Each partition replica of Kafka is bound to the disk of the machine. The partition replica consists of a series of segments. Therefore, single-partition storage often occupies a large amount of disk space, which will put a lot of pressure on the disk.
Rebalance must be done when the cluster expands the broker, and the broker needs to have a good execution process to ensure that the load of each broker is balanced without any failure.

Cruise control was born for the difficulty of Kafka cluster operation and maintenance. It can well solve the problem of Kafka's difficulty in operation and maintenance.

Reference article: