Kafka trillion-level news combat

One, Kafka application

This article mainly summarizes when the Kafka cluster traffic reaches a trillion-level record/day or ten trillion-level record/day or even higher, what capabilities we need to have to ensure the cluster's high availability, high reliability, high performance, high throughput, and safe operation .

The summary here is mainly for Kafka 2.1.1 version, including cluster version upgrade, data migration, traffic limit, monitoring alarm, load balancing, cluster expansion/reduction, resource isolation, cluster disaster tolerance, cluster security, performance optimization, platformization, open source Version defects, community dynamics, etc. This article mainly introduces the core context without explaining too much in detail. Let's first take a look at some core application scenarios of Kafka as a data hub.

The following figure shows some mainstream data processing processes, and Kafka acts as a data hub.

Next, take a look at the overall architecture of our Kafka platform;

1.1 version upgrade

1.1.1 How to upgrade and roll back the open source version

Official website address: http://kafka.apache.org

1.1.1.2 How to upgrade and roll back the source code transformation

Because in the upgrade process, the logic of the new and old code will inevitably alternate. Some nodes in the cluster are open source versions, and some nodes are modified versions. Therefore, you need to consider the mixing of new and old codes during the upgrade process, how to be compatible, and how to roll back when a failure occurs.

1.2 Data migration

Due to the architectural characteristics of the Kafka cluster, this will inevitably lead to unbalanced traffic load in the cluster, so we need to do some data migration to achieve traffic balance between different nodes in the cluster. The open source version of Kafka provides a script tool " bin/kafka-reassign-partitions.sh " for data migration. If you do not implement automatic load balancing, you can use this script.

The script provided by the open source version to generate the migration plan is completely manual intervention. When the cluster size is very large, the migration efficiency becomes very low, and the calculation is generally performed in units of days. Of course, we can implement an automated balancing program. After the load balancing is automated, we basically use the API provided by the internal call, and the program will help us generate the migration plan and execute the migration task. It should be noted that there are two types of migration plans: designated data directory and undesignated data directory. The designated data directory needs to be configured with ACL security authentication.

Official website address: http://kafka.apache.org

1.2.1 Data migration between brokers

does not specify the data directory

//未指定迁移目录的迁移计划
{
    "version":1,
    "partitions":[
        {"topic":"yyj4","partition":0,"replicas":[1000003,1000004]},
        {"topic":"yyj4","partition":1,"replicas":[1000003,1000004]},
        {"topic":"yyj4","partition":2,"replicas":[1000003,1000004]}
    ]
}

Specify the data directory

//指定迁移目录的迁移计划
{
    "version":1,
    "partitions":[
        {"topic":"yyj1","partition":0,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]},
        {"topic":"yyj1","partition":1,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]},
        {"topic":"yyj1","partition":2,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]}
    ]
}

1.2.2 Data migration between the internal disks of the broker

Servers in the production environment are generally mounted with multiple hard disks, such as 4/12, etc.; then it may appear in the Kafka cluster, and the traffic among the brokers is relatively balanced, but inside the broker, the traffic among the disks is not balanced, causing some The disk is overloaded, which affects the performance and stability of the cluster, and does not make good use of hardware resources. In this case, we need to load balance the traffic of multiple disks inside the broker so that the traffic is more evenly distributed to each disk.

1.2.3 Concurrent data migration

The copy migration tool "bin/kafka-reassign-partitions.sh" provided by the current Kafka open source version (version 2.1.1) can only achieve the serialization of migration tasks in the same cluster. For the situation where multiple resource groups have been physically isolated in the cluster, because the resource groups will not affect each other, but they cannot amicably submit migration tasks in parallel, the migration efficiency is a bit low. This deficiency was not achieved until version 2.6.0. solve. If you need to implement concurrent data migration, you can choose to upgrade the Kafka version or modify the Kafka source code.

1.2.4 Terminate data migration

The copy migration tool "bin/kafka-reassign-partitions.sh" provided by the current Kafka open source version (version 2.1.1) cannot terminate the migration after the migration task is started. When the migration task has an impact on the stability or performance of the cluster, it will become helpless and can only wait for the completion of the migration task (success or failure). This deficiency will not be resolved until version 2.6.0. If you need to terminate the data migration, you can choose to upgrade the Kafka version or modify the Kafka source code.

1.3 Flow limit

1.3.1 Production and consumption flow restriction

There are often sudden and unpredictable abnormal production or consumption flows that will put huge pressure on the cluster's IO and other resources, and ultimately affect the stability and performance of the entire cluster. Then we can limit the flow of users' production, consumption, and data synchronization between replicas. This flow-limiting mechanism is not to limit users, but to prevent sudden traffic from affecting the stability and performance of the cluster, and provide users with better services.

As shown in the figure below, the inbound traffic of the node has increased from about 140MB/s to 250MB/s, and the outbound traffic has increased from about 400MB/s to 800MB/s. If there is no current limiting mechanism, then multiple nodes in the cluster will be at risk of being hung up by these abnormal traffic, and even cause a cluster avalanche.

Picture production/consumption flow limit official website address: click on the link

For the flow restriction of producers and consumers, the official website provides the following combinations of dimensions for restriction (of course, the following flow restriction mechanism has certain flaws, which we will mention later in the "Functional Defects of Kafka Open Source Version"):

/config/users/<user>/clients/<client-id> //根据用户和客户端ID组合限流
/config/users/<user>/clients/<default>
/config/users/<user>//根据用户限流 这种限流方式是我们最常用的方式
/config/users/<default>/clients/<client-id>
/config/users/<default>/clients/<default>
/config/users/<default>
/config/clients/<client-id>
/config/clients/<default>

When starting the Kafka broker service, you need to enable the JMX parameter configuration to facilitate the collection of various JMX indicators of Kafka for service monitoring through other applications. When the user needs to adjust the flow limit threshold, an intelligent assessment is made based on the traffic that a single broker can withstand, without manual intervention to determine whether it can be adjusted; for user traffic limit, the main indicators that need to be referred to include the following two:

（1）消费流量指标：ObjectName：kafka.server:type=Fetch,user=acl认证用户名称 属性：byte-rate（用户在当前broker的出流量）、throttle-time（用户在当前broker的出流量被限制时间）
（2）生产流量指标：ObjectName：kafka.server:type=Produce,user=acl认证用户名称 属性：byte-rate（用户在当前broker的入流量）、throttle-time（用户在当前broker的入流量被限制时间）

1.3.2 Follower synchronization leader/data migration traffic limit

Copy migration/data synchronization traffic limit official website address: link

The parameters involved are as follows:

//副本同步限流配置共涉及以下4个参数
leader.replication.throttled.rate
follower.replication.throttled.rate
leader.replication.throttled.replicas
follower.replication.throttled.replicas

The auxiliary indicators are as follows:

（1）副本同步出流量指标：ObjectName：kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec
（2）副本同步入流量指标：ObjectName：kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec

1.4 Monitoring alarm

There are some open source tools available for Kafka monitoring, such as the following:

Kafka Manager；
Kafka Eagle；
Kafka Monitor；
KafkaOffsetMonitor；

We have embedded Kafka Manager as a tool for us to view some basic indicators. However, these open source tools cannot be integrated into our own business system or platform. Therefore, we need to implement a system with finer granularity, smarter monitoring, and more accurate alarms. Its monitoring coverage should include basic hardware, operating system (the operating system occasionally hangs in the system process, resulting in broker suspended animation and unable to provide services normally), Kafka's broker service, Kafka client application, zookeeper cluster, and the entire upstream and downstream chain Road monitoring.

1.4.1 Hardware Monitoring

network monitoring:

Core indicators include network inbound traffic, network outbound traffic, network packet loss, network retransmissions, the number of TCP connections in TIME.WAIT, switches, computer room bandwidth, DNS server monitoring (if the DNS server is abnormal, traffic black holes may occur, causing a large area Business failure) and so on.

Disk monitoring:

Core indicators include monitoring disk write, disk read (if there is no delay during consumption, or there is only a small delay, generally there is no disk read operation), disk ioutil, disk iowait (if this indicator is too high, the disk load is large), disk Storage space, bad disks, bad disk blocks/bad sectors (bad sectors or bad blocks will cause the broker to be half-dead, and consumers will be stuck due to crc verification), etc.

CPU monitoring:

Monitor CPU idle rate/load, motherboard failure, etc. Generally, low CPU usage is not the bottleneck of Kafka.

Memory/swap area monitoring:

Memory usage, memory failure. In general, except for the heap memory allocated when Kafka's broker is started, all other memory on the server is basically used as PageCache.

cache hit rate monitoring:

Since whether to read the disk has a great impact on the performance of Kafka, we need to monitor the cache hit rate of Linux PageCache. If the cache hit rate is high, it means that consumers basically hit the cache.

For details, please read the article: " Linux Page Cache Tuning Application in Kafka".

system log:

We need to monitor and alert the error log of the operating system to discover some hardware failures in time.

1.4.2 broker service monitoring

The monitoring of the broker service is mainly by specifying the JMX port when the broker service is started, and then collecting JMX indicators by implementing a set of indicator collection programs. ( server indicator official website address )

-level monitoring : broker process, broker inbound traffic byte size/number of records, broker outbound traffic byte size/number of records, copy synchronization inflow, copy synchronization outflow, traffic deviation between brokers, number of broker connections, broker requests Number of queues, broker network idle rate, broker production delay, broker consumption delay, broker production requests, broker consumption requests, number of leaders distributed on the broker, number of copies distributed on the broker, disk traffic on the broker, broker GC Wait.

topic-level monitoring : topic inbound traffic byte size/number of records, topic outbound traffic byte size/number of records, no-traffic topics, topic traffic sudden changes (sudden increase/sudden drop), topic consumption delay.

Partition-level monitoring : partition incoming traffic byte size/number of records, partition outgoing traffic byte size/number of records, topic partition copy missing, partition consumption delay record, partition leader switching, partition data skew (when producing messages, if specified The key of the message is easy to cause data skew, which seriously affects the service performance of Kafka), partition storage size (it can manage the topic with a single partition that is too large).

User-level monitoring : User outbound/inbound traffic byte size, user outbound/inbound traffic restricted time, user traffic sudden change (sudden increase/sudden drop).

broker service log monitoring : Monitor and alert the error log printed on the server side, and discover service abnormalities in time.

1.4.3. Client monitoring

Client monitoring is mainly to implement a set of indicator reporting procedures by yourself, this procedure needs to be implemented

org.apache.kafka.common.metrics.MetricsReporter interface. Then add the configuration item metric.reporters to the configuration of the producer or consumer, as shown below:

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, IntegerSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
 
//ClientMetricsReporter类实现org.apache.kafka.common.metrics.MetricsReporter接口
props.put(ProducerConfig.METRIC_REPORTER_CLASSES_CONFIG, ClientMetricsReporter.class.getName());
...

Client indicator official website address:

http://kafka.apache.org/21/documentation.html#selector_monitoring

http://kafka.apache.org/21/documentation.html#common\_node\_monitoring

http://kafka.apache.org/21/documentation.html#producer_monitoring

http://kafka.apache.org/21/documentation.html#producer\_sender\_monitoring

http://kafka.apache.org/21/documentation.html#consumer_monitoring

http://kafka.apache.org/21/documentation.html#consumer\_fetch\_monitoring

The client monitoring process architecture is shown in the following figure:

1.4.3.1 Producer client monitoring

dimensions: user name, client ID, client IP, topic name, cluster name, brokerIP;
indicators: number of connections, IO waiting time, production traffic size, number of production records, number of requests, request delay, sending errors/retry times, etc.

1.4.3.2 Consumer client monitoring

dimensions: user name, client ID, client IP, topic name, cluster name, consumer group, brokerIP, topic partition;
indicators: connection number, io waiting time, consumption flow size, consumption record number, consumption delay, topic partition consumption delay record, etc.

1.4.4 Zookeeper monitoring

Zookeeper process monitoring;
Zookeeper's leader switching monitoring;
Error log monitoring of Zookeeper service;

1.4.5 Full link monitoring

When the data link is very long (for example: business application -> buried point SDK -> data collection -> Kafka -> real-time computing -> business application), we usually need to find out the problem after repeated communication and investigation by multiple teams In which part of the problem is the problem, it is relatively inefficient to troubleshoot the problem. In this case, we need to sort out the monitoring of the entire link together with upstream and downstream. When a problem occurs, locate where the problem occurs as soon as possible, shortening the time for problem location and fault recovery.

1.5 Resource isolation

1.5.1 Physical isolation of different business resources in the same cluster

We physically isolate resource groups for different businesses in all clusters to avoid mutual influence between businesses. Here, we assume that the cluster has 4 broker nodes (Broker1/Broker2/Broker3/Broker4) and 2 businesses (Business A/Business B). They have topic partition distribution as shown in the figure below, and both business topics are scattered in On each broker in the cluster, there is also a crossover at the disk level.

Imagine if one of our business is abnormal, such as a sudden increase in traffic, which causes the broker node to be abnormal or hang. Then another business will also be affected at this time, which will greatly affect the availability of our services, cause failures, and expand the scope of failure.

In response to these pain points, we can isolate the business in the cluster physically, each business has exclusive resources, and divide the resource group (here, the 4 brokers are divided into two resource groups, Group1 and Group2), as shown in the following figure. Topics are distributed in their own resource groups. When one business is abnormal, it will not affect the other business. This can effectively reduce the scope of our failures and improve service availability.

1.6 Cluster Classification

We split the cluster into log clusters, monitoring clusters, billing clusters, search clusters, offline clusters, online clusters, etc. according to business characteristics. Services in different scenarios are placed in different clusters to avoid mutual influence of different businesses.

1.7 Expansion/shrinkage

1.7.1 topic expansion partition

As the amount of topic data grows, the number of partitions specified by the topic we initially created may no longer meet the volume flow requirements, so we need to expand the topic partitions. The following points need to be considered when expanding the partition:

It is necessary to ensure that the topic partition leader and follower polls are distributed on all brokers in the resource group to make the traffic distribution more balanced. At the same time, it is necessary to consider the cross-rack distribution of different replicas of the same partition to improve disaster tolerance;
When the number of topic partition leaders divided by the number of resource group nodes has a remainder, the remainder partition leader needs to be given priority to the broker with lower traffic.

1.7.2 The broker goes online

As the business volume increases and the data volume continues to increase, our cluster also needs broker node expansion. Regarding expansion, we need to achieve the following points:

Intelligent evaluation of capacity expansion: According to the cluster load, the evaluation of whether capacity expansion is needed is programmed and intelligent;
Intelligent expansion: After evaluating the need for expansion, platformize the expansion process and traffic balance.

1.7.3 Broker goes offline

In some scenarios, we need to offline our broker, mainly including the following scenarios:

Some aging servers need to be offline to realize the platformization of node offline;
Server failure, broker failure cannot be recovered, we need to offline the failed server to realize node offline platform;
Replace the existing broker nodes with better-configured servers to realize the platformization of offline nodes.

1.8 Load balancing

Why do we need load balancing? First of all, let's look at the first picture. The following picture shows the traffic distribution of a resource group in our cluster just after expansion. The traffic cannot be automatically allocated to our newly expanded nodes. Then we need to manually trigger the data migration at this time, and migrate some copies to the new node to achieve traffic balance.

Next, let's take a look at the second picture. In this picture, we can see that the flow distribution is very unbalanced, with the deviation of the minimum and maximum flow by more than several times. This is related to the architectural characteristics of Kafka. When the cluster size and data volume reach a certain amount, problems will inevitably occur. In this case, we also need to perform load balancing.

Let's take a look at the third picture. Here we can see that the traffic has only a sudden increase in some nodes. This is because the topic partition is not dispersed enough within the cluster and is concentrated in a few brokers. In this case, we also need to expand the partition and balance.

Our ideal traffic distribution should be as shown in the figure below. The traffic deviation between nodes is very small. In this case, it can not only enhance the cluster’s ability to handle abnormal traffic surges, but also improve the overall resource utilization and service stability of the cluster. lower the cost.

For load balancing, we need to achieve the following effects:

1) Generate a copy migration plan and execute the migration task to be platform-based, automated, and intelligent;
2) After balancing, the traffic between brokers is relatively even, and a single topic partition is evenly distributed on all broker nodes;
3) After balancing, the traffic among multiple disks in the broker is more balanced;

To achieve this effect, we need to develop a set of our own load balancing tools, such as the secondary development of the open source cruise control; the core of this tool is mainly the strategy of generating the migration plan, and the generation plan of the migration plan directly affects the final cluster load Balanced effect. Reference content:

1. linkedIn/cruise-control

2. Introduction to Kafka Cruise Control

3. Cloudera Cruise Control REST API Reference

The cruise control architecture diagram is as follows:

When generating the migration plan, we need to consider the following points:

1) Select core indicators as the basis for generating the migration plan, such as outbound traffic, inbound traffic, racks, single-topic partition dispersion, etc.;
2) Optimize the indicator samples used to generate the migration plan, such as filtering abnormal samples such as sudden increase in traffic, sudden drop, and zero drop;
3) The samples used in the migration plan of each resource group are all internal samples of the resource group, no other resource groups are involved, and no crossover;
4) Govern a topic with a single partition that is too large, so that the topic partition is distributed more dispersed, traffic is not concentrated in some brokers, and the data volume of a topic single partition is smaller, which can reduce the amount of data to be migrated and increase the speed of migration;
5) Topics that have been evenly dispersed in the resource group are added to the migration blacklist without migration, which can reduce the amount of migrated data and increase the migration speed;
6) Do topic management to eliminate the interference of long-term no-flow topics on the balance;
7) When creating a new topic or topic partition expansion, all partitions should be polled and distributed to all broker nodes, and the remainder partition will give priority to the brokers with lower traffic after polling;
8) When the load balancing is turned on after expanding the broker node, the same broker is assigned the same large traffic (large traffic rather than large storage space, here can be considered as the throughput per second) topic for multiple partition leaders, and part of it is migrated to the new one broker node;
9) When submitting a migration task, the size deviation of the partition data in the same batch of migration plans should be as small as possible, so as to avoid waiting for the migration of the large partition for a long time after the migration of the small partition in the migration task is completed, causing the task to tilt;

1.9 Safety certification

Is it possible for everyone in our cluster to access it at will? Of course not. For the security of the cluster, we need to perform permission authentication to shield illegal operations. It mainly includes the following aspects that require safety certification:

(1) Authorization of the producer;
(2) Consumer authority authentication;
(3) Designated data directory migration safety certification;

Official website address: http://kafka.apache.org

1.10 Cluster disaster recovery

Cross-rack disaster recovery:

Official website address: http://kafka.apache.org

Cross-cluster/machine room disaster recovery : If there are business scenarios such as remote active-active, you can refer to Kafka 2.7 version of MirrorMaker 2.0.

GitHub address: https://github.com
Precise KIP address: https://cwiki.apache.org

Kafka metadata recovery on the ZooKeeper cluster : We will regularly back up the permission information data on ZooKeeper, and use it for recovery when the cluster metadata is abnormal.

1.11 Parameter/configuration optimization

service parameter optimization : Here I only list some of the core parameters that affect performance.

num.network.threads
#创建Processor处理网络请求线程个数，建议设置为broker当CPU核心数*2，这个值太低经常出现网络空闲太低而缺失副本。
 
num.io.threads
#创建KafkaRequestHandler处理具体请求线程个数，建议设置为broker磁盘个数*2
 
num.replica.fetchers
#建议设置为CPU核心数/4，适当提高可以提升CPU利用率及follower同步leader数据当并行度。
 
compression.type
#建议采用lz4压缩类型，压缩可以提升CPU利用率同时可以减少网络传输数据量。
 
queued.max.requests
#如果是生产环境，建议配置最少500以上，默认为500。
 
log.flush.scheduler.interval.ms
log.flush.interval.ms
log.flush.interval.messages
#这几个参数表示日志数据刷新到磁盘的策略，应该保持默认配置，刷盘策略让操作系统去完成，由操作系统来决定什么时候把数据刷盘；
#如果设置来这个参数，可能对吞吐量影响非常大；
 
auto.leader.rebalance.enable
#表示是否开启leader自动负载均衡，默认true；我们应该把这个参数设置为false，因为自动负载均衡不可控，可能影响集群性能和稳定；

production optimization : Here I only list some of the core parameters that affect performance.

linger.ms
#客户端生产消息等待多久时间才发送到服务端，单位：毫秒。和batch.size参数配合使用；适当调大可以提升吞吐量，但是如果客户端如果down机有丢失数据风险；
 
batch.size
#客户端发送到服务端消息批次大小，和linger.ms参数配合使用；适当调大可以提升吞吐量，但是如果客户端如果down机有丢失数据风险；
 
compression.type
#建议采用lz4压缩类型，具备较高的压缩比及吞吐量；由于Kafka对CPU的要求并不高，所以，可以通过压缩，充分利用CPU资源以提升网络吞吐量；
 
buffer.memory
#客户端缓冲区大小，如果topic比较大，且内存比较充足，可以适当调高这个参数，默认只为33554432(32MB)
 
retries
#生产失败后的重试次数，默认0，可以适当增加。当重试超过一定次数后，如果业务要求数据准确性较高，建议做容错处理。
 
retry.backoff.ms
#生产失败后，重试时间间隔，默认100ms，建议不要设置太大或者太小。

In addition to some core parameter optimization, we also need to consider the number of topic partitions and topic retention time; if the number of partitions is too small, the retention time is too long, but the amount of written data is very large, which may cause the following problems:

1) The topic partitions are concentrated on a few broker nodes, resulting in an imbalance of traffic replicas;
2) Some disks in the broker node are overloaded for reading and writing, and the storage is overwritten;

1.11.1 Consumption optimization

The biggest problem with consumption, and a frequent problem is consumption delay, pulling historical data. When a large amount of historical data is pulled, a large number of disk read operations will occur, which will pollute the pagecache, which will increase the load of the disk and affect the performance and stability of the cluster;

can 160ab5602bddea reduce or avoid large consumption delay?

1) When the topic data volume is very large, it is recommended to start a thread for one partition to consume;
2) Add monitoring alarms to topic consumption delays, discover and deal with them in time;
3) When topic data can be discarded and encounters a large delay, such as a single partition delay record exceeding tens of millions or even hundreds of millions, then the topic consumption point can be reset for emergency processing; [This solution is generally used in extreme scenarios]
4) Avoid resetting the topic partition offset to a very early position, which may cause a large amount of historical data to be pulled;

1.11.2 Linux server parameter optimization

We need to optimize the Linux file handle, pagecache and other parameters. refer to "160ab5602bde96 Linux Page Cache Tuning Application in Kafka".

1.12. Hardware optimization

disk optimization

If conditions permit, you can use SSD solid state drives to replace HDD mechanical hard drives to solve the problem of low mechanical disk IO performance; if there is no SSD solid state drive, you can do hard RAID for multiple hard drives on the server (usually RAID 10) , To make the IO load of the broker node more balanced. If it is a HDD mechanical hard drive, one broker can mount multiple hard drives, such as 12*4TB.

Memory

Since Kafka is a high-frequency read and write service, and Linux read and write requests are basically Page Cache, a larger single node memory will significantly improve performance. Generally choose 256GB or higher.

network

Increase network bandwidth: When conditions permit, the larger the network bandwidth, the better. Because of this, the network bandwidth will not become a performance bottleneck, and at least a 10 Gigabit network (10Gb, full-duplex network card) can be achieved to have a relatively high throughput. If it is a single channel, the theoretical upper limit of the sum of network outgoing traffic and inbound traffic is 1.25GB/s; if it is a duplex dual channel, the theoretical value of network inbound and outbound traffic can reach 1.25GB/s.

Network isolation marking: Because a computer room may be deployed with both offline clusters (such as HBase, Spark, Hadoop, etc.) and real-time clusters (such as Kafka). Then the real-time cluster and the offline cluster mounted to the same switch will compete for network bandwidth, and the offline cluster may affect the real-time cluster. Therefore, we need to perform switch-level isolation so that offline machines and real-time clusters do not mount to the same switch. Even if there are those mounted under the same switch, we will also mark the network traffic priority (gold, silver, copper, iron), and when the network bandwidth is tight, let real-time services pass first.

CPU

The bottleneck of Kafka is not the CPU, a single node generally has 32 cores of CPU enough to use.

1.13. Platformization

Now the problem is coming. We mentioned many monitoring and optimization methods; do our administrators or business users need to log in to the cluster server for all operations of the cluster? The answer is of course no, we need rich platform features to support. On the one hand, it is to improve the efficiency of our operations, and on the other hand, it is also to improve the stability of the cluster and reduce the possibility of errors.

Configuration Management

Black screen operation, every time you modify the server.properties configuration file of the broker, there is no change record to be traced. Sometimes it may be because someone modified the cluster configuration and caused some failures, but the related records were not found. If we implement configuration management on the platform, every change can be traced, and at the same time reduce the risk of change errors.

Rolling restart

When we need to make online changes, sometimes we need to perform a rolling restart of multiple nodes in the cluster. If we go to the command line to operate, the efficiency will become very low, and manual intervention is required, which wastes manpower. At this time, we need to platform this repetitive work to improve our operational efficiency.

cluster management

Cluster management is mainly to implement a series of operations originally on the command line on the platform. Users and administrators no longer need to operate the Kafka cluster with a black screen; this has the following advantages:

Improve operation efficiency;
The probability of operation errors is smaller, and the cluster is safer;
All operations are traceable and can be traced;

Cluster management mainly includes: broker management, topic management, production/consumption authority management, user management, etc.

1.13.1 mock function

Provide the functions of production sample data and consumption sampling for the user's topic on the platform. Users can test whether the topic can be used and whether the permissions are normal without writing code by themselves;
Provide the production/consumption permission verification function for the user's topic on the platform, so that the user can know whether his account has the read and write permission for a certain topic;

1.13.2 Rights Management

Platformize related operations such as user read/write authority management.

1.13.3 Expansion/shrinkage

Put the broker nodes online and offline on the platform, and all the online and offline nodes no longer need to operate the command line.

1.13.4 Cluster governance

1) Governance of no-flow topics, clean up no-flow topics in the cluster, and reduce the pressure on the cluster caused by excessive useless metadata;
2) Topic partition data size management, sort out topics with too large topic partition data (such as single partition data more than 100GB/day) to see if it needs to be expanded to avoid data concentration on some nodes of the cluster;
3) Topic partition data tilt management to avoid the client specifying the key of the message when the message is produced, but the key is too concentrated, and the message is only concentrated in some partitions, resulting in data tilt;
4) Distributed governance of topic partitions, so that topic partitions are distributed on as many brokers as possible in the cluster, which can avoid the risk of a sudden increase in topic traffic and the traffic is only concentrated on a few nodes, and it can also avoid the impact of a certain broker exception on the topic Very big;
5) Topic partition consumption delay management; generally there are two situations when there is more delay consumption, one is the performance of the cluster is reduced, and the other is the consumption concurrency of the business side is not enough, if the consumer is not concurrency enough Should be in contact with business to increase consumption concurrency.

1.13.5 Monitoring alarm

1) Make all index collections into a configurable platform, provide a unified index collection, index display and alarm platform, and realize integrated monitoring;
2) Associate upstream and downstream services to make full link monitoring;
3) Users can configure monitoring alarms such as topic or partition traffic delay, sudden change, etc.;

1.13.6 Business big screen

Main indicators of the business large screen: number of clusters, number of nodes, daily inflow, daily inflow records, sunrise flow, sunrise flow records, inflow per second, inflow per second, outflow per second Size, outgoing traffic records per second, number of users, production delay, consumption delay, data reliability, service availability, data storage size, number of resource groups, number of topics, number of partitions, number of replicas, consumer groups Number and other indicators.

1.13.7 Flow limit

The user traffic is now on the platform, and intelligent current limiting processing is performed on the platform.

1.13.8 Load Balancing

The automatic load balancing function is implemented on the platform, and the scheduling and management are performed through the platform.

1.13.9 Resource budget

When the cluster reaches a certain size and the traffic continues to grow, where do the cluster expansion machines come from? The resource budget of the business allows multiple businesses in the cluster to share the hardware cost of the entire cluster according to their traffic in the cluster; of course, the budget method for independent clusters and independent isolated resource groups can be calculated separately.

1.14. Performance evaluation

1.14.1 Single broker performance evaluation

The purpose of our single broker performance evaluation includes the following aspects:

1) Provide a basis for our resource application evaluation;
2) Let us know more about the read and write capabilities of the cluster and where the bottleneck is, and optimize for the bottleneck;
3) Provide a basis for our current limit threshold setting;
4) Provide a basis for us to evaluate when we should expand;

1.14.2 topic partition performance evaluation

1) When creating a topic for us, evaluate how many partitions should be specified to provide a reasonable basis;
2) Provide a basis for our topic partition expansion evaluation;

1.14.3 Single Disk Performance Evaluation

1) Provide a basis for us to understand the real read and write capabilities of the disk, and provide a basis for us to choose a more suitable Kafka disk type;
2) Provide a basis for us to set the disk traffic alarm threshold;

1.14.4 Exploring the cluster size limit

1) We need to understand the upper limit of the scale of a single cluster or the upper limit of the metadata scale, and explore the impact of related information on the performance and stability of the cluster;
2) Evaluate the reasonable range of the cluster node scale according to the bottom-line situation, predict risks in time, and carry out work such as splitting the super-large cluster;

1.15 DNS+LVS network architecture

When our cluster nodes reach a certain scale, such as a single cluster with hundreds of broker nodes, then when we specify the bootstrap.servers configuration for the production consumer client, what if we specify it? Should you choose any of these broker configurations or all of them?

In fact, the above approach is not appropriate. If only a few IPs are configured, when we configure a few broker nodes to go offline, our application will not be able to connect to the Kafka cluster; if all IPs are configured, it is even more unrealistic, hundreds of IP, so what should we do?

solution: adopts DNS+LVS network architecture, the final producer and consumer client only need to configure the domain name. It should be noted that when a new node joins the cluster, a mapping needs to be added; when a node goes offline, it needs to be kicked out of the mapping, otherwise if these machines are used in other places, if the port is the same as Kafka, It turns out that part of the cluster request will be sent to this offline server, causing a major failure in the production environment.

Second, the open source version function defects

The main features of the RTMP protocol are: multiplexing, subcontracting and application layer protocols. These features will be described in detail below.

2.1 Copy migration

Unable to achieve incremental migration; [We have implemented incremental migration based on the source code of version 2.1.1]
Unable to achieve concurrent migration; [the open source version did not achieve concurrent migration until 2.6.0]
It is impossible to terminate the migration; [We have implemented the termination copy migration based on the source code of version 2.1.1] [The open source version did not realize the suspension of migration until 2.6.0, which is somewhat different from the termination of the migration, and the metadata will not be rolled back]
When specifying the migration data directory, during the migration process, if the topic retention time is shortened, the topic retention time will not take effect for the topic partition being migrated, and the topic partition expired data cannot be deleted; [open source version bug, currently not fixed]
When the migration data directory is specified, when the migration plan is the following scenario, the entire migration task cannot be completed and the migration has been stuck; [open source version bug, currently not fixed]
During the migration process, if a broker node is restarted, all the leader partitions on that broker node cannot be switched back, causing all node traffic to be transferred to other nodes. The leader will not switch back until all copies are migrated; [open source version bug, currently Not repaired yet].

在原生的Kafka版本中存在以下指定数据目录场景无法迁移完毕的情况，此版本我们也不决定修复次bug：
 
1.针对同一个topic分区，如果部分目标副本相比原副本是所属broker发生变化，部分目标副本相比原副本是broker内部所属数据目录发生变化；
那么副本所属broker发生变化的那个目标副本可以正常迁移完毕，目标副本是在broker内部数据目录发生变化的无法正常完成迁移；
但是旧副本依然可以正常提供生产、消费服务，并且不影响下一次迁移任务的提交，下一次迁移任务只需要把此topic分区的副本列表所属broker列表变更后提交依然可以正常完成迁移，并且可以清理掉之前未完成的目标副本；
 
这里假设topic yyj1的初始化副本分布情况如下：
 
{
"version":1,
"partitions":[
{"topic":"yyj","partition":0,"replicas":[1000003,1000001],"log_dirs":["/kfk211data/data31","/kfk211data/data13"]}
]
}
//迁移场景1：
{
"version":1,
"partitions":[
{"topic":"yyj","partition":0,"replicas":[1000003,1000002],"log_dirs":["/kfk211data/data32","/kfk211data/data23"]}
]
}
 
//迁移场景2：
{
"version":1,
"partitions":[
{"topic":"yyj","partition":0,"replicas":[1000002,1000001],"log_dirs":["/kfk211data/data22","/kfk211data/data13"]}
]
}
针对上述的topic yyj1的分布分布情况，此时如果我们的迁移计划为“迁移场景1”或迁移场景2“，那么都将出现有副本无法迁移完毕的情况。
但是这并不影响旧副本处理生产、消费请求，并且我们可以正常提交其他的迁移任务。
为了清理旧的未迁移完成的副本，我们只需要修改一次迁移计划【新的目标副本列表和当前分区已分配副本列表完全不同即可】，再次提交迁移即可。
 
这里，我们依然以上述的例子做迁移计划修改如下：
{
"version":1,
"partitions":[
{"topic":"yyj","partition":0,"replicas":[1000004,1000005],"log_dirs":["/kfk211data/data42","/kfk211data/data53"]}
]
}
这样我们就可以正常完成迁移。

2.2 Flow protocol

The current limiting granularity is relatively coarse, not flexible enough, precise enough, and smart enough.

Current limiting dimension combination

/config/users/<user>/clients/<client-id>
/config/users/<user>/clients/<default>
/config/users/<user>
/config/users/<default>/clients/<client-id>
/config/users/<default>/clients/<default>
/config/users/<default>
/config/clients/<client-id>
/config/clients/<default>

there is a problem

When multiple users on the same broker are simultaneously performing a large number of production and consumption, if you want the broker to operate normally, you must make the sum of all user traffic thresholds not exceed the broker's throughput limit when limiting the flow; if it exceeds The broker upper limit, then the broker is at risk of being suspended; however, even if the user traffic does not reach the broker’s upper limit, but if all user traffic is concentrated on a few disks, the read and write load of the disk will be exceeded. All production and consumption requests will be blocked, and the broker may be in a state of suspended animation.

solution

(1) Modify the source code to achieve the upper limit of the traffic of a single broker. As long as the traffic reaches the upper limit of the broker, the current limit will be processed immediately, and all users writing to this broker can be restricted; or the user can be prioritized, and the high priority will be ignored. , Restrict low priority;
(2) Modify the source code to realize the upper limit of the single disk traffic on the broker (in many cases, the traffic is concentrated on a few disks, resulting in not reaching the upper limit of the broker traffic but exceeding the upper limit of the single disk read and write capacity), as long as the disk traffic reaches Upper limit, immediate current limit processing, all users who write to this disk can be restricted; or prioritize users, let go of high priority and restrict low priority;
(3) Transform the source code to realize the topic dimension current limit and the write prohibition function for topic partitions;
(4) Transform the source code to achieve precise current limit by the combination of user, broker, disk, topic and other dimensions;

Three, the development trend of kafka

3.1 Kafka community iteration plan

3.2 gradually abandon ZooKeeper (KIP-500)

3.3 separates the controller from the broker, and introduces the raft protocol as the controller's arbitration mechanism (KIP-630)

3.4 tiered storage (KIP-405)

3.5 can reduce topic partition (KIP-694)

3.6 MirrorMaker2 accurate once (KIP-656)

3.7 download and feature description of each version

3.8 All KIP addresses of Kafka

4. How to contribute to the community

4.1 What points can contribute

http://kafka.apache.org/contributing

4.2 wiki contribution address

https://cwiki.apache.org/confluence/dashboard.action#all-updates

4.3 issues address

1）https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-10444?filter=allopenissues

2）https://issues.apache.org/jira/secure/BrowseProjects.jspa?selectedCategory=all

4.4 Main committers

http://kafka.apache.org/committers

Author: vivo internet server team-Yang Yijun

Kafka trillion-level news combat

One, Kafka application

1.1 version upgrade

1.1.1 How to upgrade and roll back the open source version

1.2 Data migration

1.2.1 Data migration between brokers

1.2.2 Data migration between the internal disks of the broker

1.2.3 Concurrent data migration

1.2.4 Terminate data migration

1.3 Flow limit

1.3.1 Production and consumption flow restriction

1.3.2 Follower synchronization leader/data migration traffic limit

1.4 Monitoring alarm

1.4.1 Hardware Monitoring

1.4.2 broker service monitoring

1.4.3. Client monitoring

1.4.4 Zookeeper monitoring

1.4.5 Full link monitoring

1.5 Resource isolation

1.5.1 Physical isolation of different business resources in the same cluster

1.6 Cluster Classification

1.7 Expansion/shrinkage

1.7.1 topic expansion partition

1.7.2 The broker goes online

1.7.3 Broker goes offline

1.8 Load balancing

1.9 Safety certification

1.10 Cluster disaster recovery

1.11 Parameter/configuration optimization

1.11.1 Consumption optimization

1.11.2 Linux server parameter optimization

1.12. Hardware optimization

1.13. Platformization

1.13.1 mock function

1.13.2 Rights Management

1.13.3 Expansion/shrinkage

1.13.4 Cluster governance

1.13.5 Monitoring alarm

1.13.6 Business big screen

1.13.7 Flow limit

1.13.8 Load Balancing

1.13.9 Resource budget

1.14. Performance evaluation

1.14.1 Single broker performance evaluation

1.14.2 topic partition performance evaluation

1.14.3 Single Disk Performance Evaluation

1.14.4 Exploring the cluster size limit

1.15 DNS+LVS network architecture

Second, the open source version function defects

2.1 Copy migration

2.2 Flow protocol

Three, the development trend of kafka

4. How to contribute to the community

vivo互联网技术

引用和评论

vivo 官网 APP 首页端智能业务实践

WGCLOUD监控系统怎么修改菜单名称?

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！

WGCAT工单系统 - 功能与优势

发现一款出色的通用主机监控系统 【WGCLOUD】免费

WGCLOUD支持在信创系统部署使用吗

资产盘点系统 WGFIX v1.1 更新特性详解

发现一款出色的通用主机监控系统【WGCLOUD】免费