Kafka plays the role of unified data caching and distribution in the Meituan data platform. With the growth of data volume and the expansion of cluster scale, the challenges Kafka faces are becoming more and more severe. This article shares the practical challenges faced by Meituan Kafka, as well as some targeted optimization work of Meituan, hoping to bring help or inspiration to students engaged in related development work.
1. Current status and challenges
1.1 Status
Kafka is an open source stream processing platform, and many Internet companies in the industry are also using this product. Let's first take a look at the current status of Kafka in the Meituan data platform.
As shown in Figure 1-1, the blue part describes the positioning of Kafka as a stream storage layer in the data platform. The main responsibility is to cache and distribute data. It will distribute the collected logs to different data systems. These logs come from system logs, client logs and business databases. The downstream data consumption system includes offline computing through ODS warehousing, direct real-time computing, synchronizing to the log center through DataLink, and OLAP analysis.
The total number of Kafka clusters in Meituan has exceeded 15,000+ machines, and the maximum number of machines in a single cluster has reached 2,000+ machines. In terms of data scale, the daily message volume has exceeded 30+P, and the peak value of the daily message volume has reached 4+00 million/second. However, as the scale of the cluster increases and the amount of data increases, the challenges Kafka faces are becoming more and more severe. Let's talk about the specific challenges.
1.2 Challenges
As shown in Figure 1-2, the specific challenges can be summarized in two parts:
The first part is that slow nodes affect reading and writing. Here, slow nodes refer to a concept of HDFS. The specific definition refers to the broker whose read and write delay TP99 is greater than 300ms. There are three reasons for slow nodes:
- Unbalanced cluster load can lead to local hot spots, that is, the disk space of the entire cluster is abundant or the ioutil is low, but some disks are about to be full or the ioutil is full.
- PageCache capacity, for example, an 80GB PageCache can only cache 8 minutes of data at a write volume of 170MB/s. Then if the data consumed is data 8 minutes ago, it is possible to trigger slow disk access.
- The threading model defect of the Consumer client can lead to distortion of end-to-end latency metrics. For example, when the consumer consumes multiple partitions in the same broker, the TP90 may be less than 100ms, but when multiple partitions are in different brokers, the TP90 may be greater than 1000ms.
The second part is the complexity of large-scale cluster management , which has four types of problems:
- Different topics will affect each other, and the traffic of individual topics will suddenly increase, or the backtracking reading of individual consumers will affect the stability of the overall cluster.
- Kafka's native Broker granularity indicators are not sound enough, which makes problem location and root cause analysis difficult.
- Fault detection is not timely, and the processing cost is high.
- Rack-level failures can make some partitions unavailable.
2. Read and write latency optimization
Next, let's first introduce what optimizations the Meituan data platform has made to address the read and write latency problem. First, from the macro level, we divide the affected factors into the application layer and the system layer, and then introduce the problems existing in the application layer and the system layer in detail, and give corresponding solutions, including pipeline acceleration, Fetcher isolation, migration cancellation and Cgroup resources Isolation, etc. The implementation of various optimization schemes is introduced in detail below.
2.1 Overview
Figure 2-1 is an overview of the problems encountered with read and write delays and the corresponding optimization solutions. We divide the affected factors into application layer and system layer.
The application layer mainly includes three types of problems:
1) Unbalanced load on the broker side, such as unbalanced disk usage, unbalanced ioutil, etc. An increase in individual disk load affects the entire Broker's requests.
2) Broker's data migration has problems of efficiency and resource competition. Specifically, it includes the following three levels:
- Migration can only be submitted serially in batches. There may be a small number of partitions in each batch that are slow to migrate, and the next batch cannot be submitted, which affects the migration efficiency.
- Migration is usually performed at night. If the migration is delayed until the midday peak, it may significantly affect read and write requests.
- Migration requests and real-time pulls share the same fetcher thread, so partition migration requests may affect real-time consumption requests.
3) The defect in the single-threaded model on the Consumer side leads to distortion of operation and maintenance indicators, and the number of partitions consumed by a single Consumer is not limited. If the consumption capacity is insufficient, it cannot keep up with the latest real-time data. When the number of consumed partitions increases, it may cause backtracking reading .
The system layer also mainly includes three types of problems:
1) PageCache pollution. Kafka uses the ZeroCopy technology provided by the kernel layer to improve performance, but the kernel layer cannot distinguish between real-time read and write requests and retrospective read requests, resulting in disk reads that may pollute PageCache and affect real-time read and write.
2) HDD has poor performance under random read and write loads. HDDs are friendly to sequential reads and writes, but for random reads and writes in mixed load scenarios, the performance drops significantly.
3) The resource competition problem of system resources such as CPU and memory in the co-location scenario. In the Meituan big data platform, in order to improve resource utilization, IO-intensive services (such as Kafka) will be mixed with CPU-intensive services (such as real-time computing jobs).
For the problems mentioned above, we have adopted a targeted strategy. For example, disk balancing at the application layer, acceleration of the migration pipeline, support for migration cancellation and consumer asynchrony, etc. Raid card acceleration at the system layer, Cgroup isolation optimization, etc. In addition, for the problem of insufficient random read and write performance of HDD, we also designed and implemented a cache architecture based on SSD.
2.2 Application layer
① Disk Balance
Disk hotspots cause two problems:
- Real-time read and write latency increases. For example, the processing time of TP99 requests exceeds 300ms, which may lead to consumption delay problems in real-time jobs and data collection congestion problems.
- The overall utilization of the cluster is insufficient. Although the cluster capacity is very sufficient, some disks are already full. At this time, some partitions may even stop serving.
In response to these two problems, we adopted a partition migration plan based on the priority of free disks. The whole plan is divided into 3 steps, which are managed by the component Rebalancer:
- Generate a migration plan. The Rebalancer continuously generates a specific partition migration plan based on the target disk usage and the current disk usage (reported through Kafka Monitor).
- Submit a migration plan. The Rebalancer submits the migration plan just generated to the Reassign node of Zookeeper. After receiving the Reassign event, the Kafka Controller will submit the Reassign event to the entire Kafka Broker cluster.
- Check the migration plan. Kafka Broker is responsible for executing data migration tasks, and Rebalancer is responsible for checking task progress.
As shown in Figure 2-2, each Disk holds 3 partitions, which is a relatively balanced state. If some Disks hold 4 partitions, such as Broker1-Disk1 and Broker4-Disk4; some Disks hold 2 partitions, such as Broker2-Disk2, Broker3-Disk3, and Reblanacer will migrate the redundant partitions on Broker1-Disk1 and Broker4-Disk4 to Broker2-Disk2 and Broker3-Disk3 respectively, so as to ensure that the overall disk utilization is as balanced as possible.
② Migration optimization
Although the partition migration based on the priority of free disks achieves disk balance, the migration itself still has problems of efficiency and resource competition. Next, we describe in detail our targeted strategy.
- The pipeline acceleration strategy is adopted to optimize the migration efficiency problem caused by the slow migration.
- Support migration cancellation to solve the problem that read and write requests are affected by slow migration of long-tail partitions.
- Fetcher isolation is adopted to alleviate the problem that data migration requests and real-time read and write requests share Fetcher threads.
Optimization 1, pipeline acceleration
As shown in Figure 2-3, the native Kafka version above the arrow only supports batch submission. For example, four partitions are submitted in one batch. When the TP4 partition is stuck and cannot be completed, all subsequent partitions cannot continue. After the pipeline acceleration is adopted, even if the TP4 partition has not been completed, new partitions can be submitted. At the same time, the original plan was hindered by the fact that TP4 was not completed, and all subsequent partitions could not be completed. In the new plan, the TP4 partition has been migrated to the TP11 partition. The dotted line in the figure represents an unordered time window, which is mainly used to control concurrency. The purpose is to keep the same number as the original group submission and avoid excessive migration affecting read and write request services.
Optimization 2, migration cancellation
As shown in Figure 2-4-1, the left side of the arrow depicts three types of lines due to migration effects. The first is because the migration will trigger the oldest read and synchronize a large amount of data. In this process, the data will be flushed back to the PageCache first, causing PageCache pollution, resulting in Cache Miss in a real-time read partition, triggering the disk degree and affecting the impact. Read and write requests; the second is that when there are some abnormal nodes that cause the migration to hang, some operation and maintenance operations cannot be performed, such as topic automatic partition expansion triggered by traffic increase. Because such operations are prohibited during the Kafka migration process. The third type is similar to the second type. Its main problem is that when the target node crashes, topic expansion cannot be completed, and users may endure the impact of read and write requests all the time.
For the three problems mentioned above, we support the migration cancellation function. The administrator can call the migration cancel command to interrupt the partition being migrated. For the first scenario, the PageCache will not be polluted, and real-time reading is guaranteed; in the second and third scenarios, the partition expansion can be completed due to the cancellation of the migration. Migration cancellation will delete the partitions that have not completed the migration. Deletion may cause bottlenecks in disk IO and affect read and write. Therefore, we support smooth deletion to avoid performance problems caused by a large number of deletions.
Optimization three, Fetcher isolation
As shown in Figure 2-5, green represents real-time reading, and red represents delayed reading. When the real-time read and delayed read of a follower share the same fetcher, the delayed read will affect the real-time read. Because the amount of data in each delayed read is significantly larger than that in real-time reads, and delayed reads can easily trigger disk reads, the data may no longer be in the PageCache, which significantly slows down Fetcher's pull efficiency.
In response to this problem, the strategy we implemented is called Fetcher isolation. That is to say, all ISR Followers share Fetcher, and all non-ISR Followers share Fetcher, which ensures that real-time reads in all ISRs will not be affected by non-ISR backtracking reads.
③ Consumer asynchronous
Before talking about Consumer asynchrony, it is necessary to explain the Kafka-Broker staged delay statistical model shown in Figure 2-6 below. The Kafka-Broker side is a typical event-driven architecture, and each component communicates through a queue. When the request flows through different components, the timestamps will be recorded in turn, and finally the execution time of the request in different stages can be counted.
Specifically, when a Kafka Producer or Consumer request enters Kafka-Broker, the Processor component writes the request to the RequestQueue, and the RequestHandler pulls the request from the RequestQueue for processing. The waiting time in the RequestQueue is RequestQueueTime, and the specific execution time of the RequestHandler is LocalTime. When the RequestHandler is executed, it will pass the request to the DelayedPurgatory component, which is a delay queue.
When a certain delay condition is triggered, the request will be written to the ResponseQueue. The duration of the DelayedPurgatory queue is RemoteTime. The Processor will continuously pull data from the ResponseQueue and send it to the client. The red ResponseTime is possible It will be affected by the client, because if the client's receiving capacity is insufficient, the ResponseTime will continue to increase. From the perspective of Kafka-Broker, the total time taken for each request, RequestTotalTime, includes the sum of the phased timings of all the processes just now.
The main problem with the continuous increase of ResponseTime is that Kafka's native Consumer has flaws in the NIO-based single-threaded model. As shown in Figure 2-7, in Phase1, User first initiates a Poll request, and Kafka-Client sends requests to Broker1, Broker2, and Broker3 at the same time. When the data of Broker1 is ready first, Kafka Client writes the data to CompleteQueue and returns immediately. Instead of continuing to pull data from Broker2 and Broker3. Subsequent Poll requests will read data directly from the CompleteQueue and return directly until the CompleteQueue is emptied. Before the CompleteQueue is emptied, even if the data on Broker2 and Broker3 is ready, it will not be pulled in time. As shown in Phase 2, due to the flaws in the single-threaded model, the duration of this part of WaitFetch increases, which leads to the continuous increase of the RespnseTime delay index of Kafka-Broker. The problem is that it is impossible to accurately monitor and subdivide the processing bottleneck of the server. .
In response to this problem, our improvement is to introduce asynchronous pull threads. The asynchronous pull thread will pull ready data in time to avoid the impact of server-side latency indicators, and native Kafka does not limit the number of partitions that can be pulled at the same time. We limit the speed here to avoid the occurrence of GC and OOM. The asynchronous thread continuously pulls data in the background and puts it into the CompleteQueue.
2.3 System layer
① Raid card acceleration
HDD has the problem of insufficient random write performance, which is manifested as increased latency and reduced throughput. In response to this problem, we introduced Raid card acceleration. Raid card has its own cache, similar to PageCache. At the Raid layer, data will be merged into larger blocks and written to Disk, making more full use of the bandwidth of sequential HDD writing, and ensuring random write performance with the help of Raid card.
② Cgroup isolation optimization
To improve resource utilization, the Meituan data platform deploys a mix of IO-intensive applications and CPU-intensive applications. IO-intensive applications refer to Kafka here, and CPU-intensive applications refer to Flink and Storm here. However, there are two problems with the original isolation strategy: First, there will be resource competition in the physical core itself. Under the same physical core, the shared L1Cache and L2Cache will compete. When the CPU of the real-time platform soars, it will cause Kafka read and write delays. Affected; secondly, Kafka's HT crosses NUMA, which increases the time-consuming of memory access. As shown in Figure 2-10, the cross-NUMA node is used for remote access through QPI, and the time-consuming of this remote access is 40ns.
In response to these two problems, we have improved the isolation strategy. For the resource competition of physical cores, our new mixed distribution strategy ensures that Kafka has exclusive physical cores, that is to say, in the new isolation strategy, there is no same physical core that is blocked by Kafka and Flink Use at the same time; then ensure that all hyperthreads of Kafka are on the same side of NUMA to avoid the access delay caused by Kafka across NUMA. With the new isolation strategy, Kafka's read and write latency is no longer affected by Flink's CPU spike.
2.4 Hybrid Layer-SSD New Cache Architecture
Background and Challenges
Kafka uses the ZeroCopy technology provided by the operating system to process data read requests. When the PageCache capacity is sufficient, data is directly copied from the PageCache to the network card, effectively reducing the read latency. But in fact, the capacity of PageCache is often insufficient, because it will not exceed the memory of a machine. When the capacity is insufficient, ZeroCopy will trigger a disk read, which not only slows down significantly, but also pollutes PageCache and affects other reads and writes.
As shown in the left half of Figure 2-11, when a delayed consumer pulls data and finds that there is no data it wants in the PageCache, a disk read will be triggered. After the disk is read, the data will be written back to the PageCache, resulting in PageCache pollution, delaying the consumer's consumption delay and slowing down another real-time consumption. Because for real-time consumption, it always reads the latest data, and the latest data should not trigger disk reads as normal.
selection and decision
In response to this problem, we have provided two solutions when making plan selection:
Option 1 , do not write back PageCache when reading the disk, such as using DirectIO, but Java does not support it;
The second option is to introduce an intermediate layer between memory and HDD, such as SSD. As we all know, compared with HDD, SSD has good random read and write ability, which is very suitable for our usage scenarios. We also have two options for SSD solutions:
decision making | Advantage | insufficient |
---|---|---|
Implementation based on operating system kernel layer | 1. The data routing is transparent to the application layer, and the amount of changes to the application code is small. <br/> 2. The robustness of the open source software itself is maintained by the community, and the usability is good (premise: the community is relatively active). | 1. In each mode of FlashCache/OpenCAS, data will be flushed back to the SSD cache. Similar to PageCache, cache pollution will occur. <br/> 2. When Cache Miss occurs, there will be one more access to the device, and the delay will increase.<br/> 3. All Meta data are maintained by the operating system, and the memory consumed by the kernel will increase. In the scenario of mixing with other engines This will reduce the amount of memory that other services can apply for. |
Kafka application internal implementation | 1. The read and write characteristics of Kafka are fully considered when designing the caching strategy to ensure that all near-real-time data consumption requests fall on the SSD, ensuring low latency for processing these requests, and at the same time, the data read from the HDD will not be flushed back to the SSD Prevent cache pollution. <br/> 2. Since each log segment has a unique and definite state, the query path for each request is the shortest, and there is no additional performance overhead caused by Cache Miss. | 1. The server-side code needs to be improved, which involves a large amount of development and testing work. <br/> 2. With the community major version upgrade, it is also necessary to iterate on these improved codes. However, the relevant code can be contributed to the community to solve the iteration problem. |
Solution 1 can be implemented based on the kernel of the operating system. In this solution, the storage space of SSD and HDD is divided into fixed-size blocks, and a mapping relationship between SSD and HDD is established. At the same time, based on the principle of data locality, data after Cache Miss will be divided into LRU and LFU. To replace some data in SSD, typical solutions in the industry include OpenCAS and FlashCache. The advantage is that the data routing is transparent to the application layer, the changes to the application code are small, and the community is active and usable; but the problem is that the locality principle does not meet the read and write characteristics of Kafka, and the problem of cache space pollution has not been fundamentally solved, because It will replace some data in SSD according to LRU and LFU.
Option 2 is implemented based on the application layer of Kafka. Specifically, the data of Kafka is stored on different devices according to the time dimension. For near real-time data, it is directly placed on SSD, and for relatively long-term data, it is directly placed on HDD, and then the leader directly Offset reads data from the corresponding device. The advantage of this solution is that its caching strategy fully takes into account the read and write characteristics of Kafka, ensuring that all near-real-time data consumption requests fall on the SSD, ensuring low latency for processing this part of the request, and at the same time, the data read from the HDD is not returned. Brushing to SSD prevents cache pollution, and because each log segment has a unique state, the purpose of each request is clear, and there is no additional performance overhead caused by Cache Miss. At the same time, the disadvantage is also obvious. It needs to be improved on the server-side code, which involves a large workload of development and testing.
Implementation
Let's introduce the specific implementation of the new SSD cache architecture.
- First, the new cache architecture will store multiple segments in the log on different storage devices according to the time dimension, as shown in the red circle 1 in Figure 2-14. The new cache architecture data will have three typical states, one is called Only Cache , which means that the data has just been written to the SSD and has not been synchronized to the HDD; the second is Cached, which means that the data is synchronized to the HDD and a part of the data is cached on the SSD; the third type is called WithoutCache, which means that the data is synchronized to the HDD but There is no cache in the SSD anymore.
- The background asynchronous thread then continuously synchronizes the SSD data to the HDD.
- With the continuous writing of the SSD, when the storage space reaches the threshold, the data that is the oldest from the current time will be deleted in chronological order, because the data space of the SSD is limited.
- The replica can flexibly enable whether to write to SSD according to availability requirements.
- The data read from the HDD will not be flushed back to the SSD, preventing cache pollution.
Detail optimization
After introducing the specific implementation, let's take a look at the detailed optimization.
- The first is about log segment synchronization, that is, the segment just mentioned, only the inactive log segment is synchronized. Inactive refers to the log segment that is not currently being written, and solves the problem of data consistency at a low cost.
- The second is to optimize the synchronization speed limit. When the SSD is synchronized to the HDD, the speed limit is required. At the same time, the two devices are protected and the processing of other IO requests will not be affected.
3. Large-scale cluster management optimization
3.1 Isolation Policy
The Kafka of the Meituan big data platform serves multiple businesses. If the topics of these businesses are mixed together, it is very likely that different topics of different businesses will affect each other. In addition, if the Controller node undertakes data read and write requests at the same time, when the load becomes significantly higher, the Controller may not be able to control requests in time, such as metadata change requests, which may eventually cause the entire cluster to fail.
In response to these mutually affecting problems, we do isolation optimization from the three dimensions of business, role and priority.
- The first point is business isolation. As shown in Figure 3-1, each large business will have an independent Kafka cluster, such as takeout, store arrival, and selection.
- The second point is role isolation, where Kafka's Broker and Controller and their dependent components, Zookeeper, are deployed on different machines to avoid mutual influence.
- The third point is to prioritize. Some business topics have a particularly high availability level, so we can assign it to a VIP cluster and give it more resource redundancy to ensure its availability.
3.2 Full link monitoring
As the scale of the cluster grows, cluster management encounters a series of problems, mainly including two aspects:
The broker-side latency indicator cannot respond to user problems in a timely manner.
- As the request volume increases, the TP99 and even TP999 latency metrics currently provided by Kafka at the broker-side granularity may not be able to reflect the long-tail latency.
- The latency indicator on the broker side is not an end-to-end indicator, and may not reflect the real problems of users.
- Failure to perceive and handle faults in a timely manner.
In response to these two problems, our strategy is full-link monitoring. Full-link monitoring collects and monitors metrics and logs of Kafka's core components. The full-link monitoring architecture is shown in Figure 3-2. When a client's read and write requests become slow, we can quickly locate the specific link through full-link monitoring. The full-link indicator monitoring is shown in Figure 3-3.
Figure 3-4 is an example of locating the request bottleneck based on the full link index. It can be seen that the server RemoteTime accounts for the highest proportion, which shows that the time is mainly spent on data replication. The log and indicator parsing service can automatically perceive faults and slow nodes in real time. Most faults (memory, disk, RAID card, and network card, etc.) and slow nodes already support automatic processing, and there is another type of fault that is unplanned faults, such as Unavailability caused by multiple copies of partitions hanging, migration hangs, and unexpected error logs, etc., require manual intervention.
3.3 Service Lifecycle Management
The server scale of Meituan Online Kafka is at the level of 10,000. As the scale of the service grows, our management of the service and the machine itself is also constantly iterating. Our automated operation and maintenance system can handle most of the machine failures and slow service nodes, but the management of machines and the service itself is separated, resulting in two types of problems:
- There is ambiguity in the state semantics, which cannot truly reflect the system state. It is often necessary to use logs and indicators to find out whether the real system is healthy or abnormal.
- The status is not comprehensive, and abnormal cases require manual intervention, and the risk of misoperation is great.
In order to solve these two types of problems, we introduce a life cycle management mechanism to ensure that the system state can be truly reflected. Life cycle management refers to the management of the whole process from the start of the service to the end of the machine, and the linkage between the service status and the machine status is achieved without manual synchronization changes. Moreover, the state change of the new life cycle management mechanism is triggered by a specific automated operation and maintenance, and manual changes are prohibited.
3.4 TOR disaster recovery
From the perspective of engineering implementation, we summarize the basic paradigm of the current mainstream graph neural network model, and implement a general framework to cover a variety of GNN models. The following are discussed separately according to the type of graph (homogeneous graph, heterogeneous graph and dynamic graph).
TOR disaster recovery ensures that different replicas of the same partition are not under the same Rack. As shown in Figure 3-7, even if the entire Rack1 fails, all partitions can be guaranteed to be available.
4 Future prospects
In the past period of time, we have done a lot of optimization to reduce the read and write latency of the server, but there is still some work to be done in terms of high service availability. For some time to come, we will focus on improving robustness and narrowing the fault domain through various granular isolation mechanisms. For example, let the client take the initiative to avoid some faulty nodes, isolate abnormal requests through multiple queues on the server side, support the server-side hot disk, the network layer actively back pressure and current limit and so on.
In addition, with the overall development of Meituan's real-time computing business, the hybrid deployment model of a real-time computing engine (typically such as Flink) and a streaming storage engine (typically such as Kafka) is increasingly difficult to meet the needs of the business. Therefore, we need a standalone deployment of Kafka while keeping current costs the same. This means that fewer machines (in our business model, 1/4 of the original machines) are needed to carry the constant business traffic. How to handle business requests with fewer machines while ensuring service stability is also one of the challenges we face.
Finally, with the advent of the cloud-native trend, we are also exploring the way to the cloud for streaming storage services.
5 About the author
Haiyuan, Shilu, Sean, Hongluo, Qifan, Hu Rong, Li Jie, etc. are all from the Data Science and Platform Department of Meituan.
Read more collections of technical articles from the Meituan technical team
Frontend | Algorithm | Backend | Data | Security | O&M | iOS | Android | Testing
| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.
| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。