Summary of Kafka knowledge points

1. What is Kafka

Kafka is a distributed, multi-partition, multi-copy, multi-producer, multi-consumer message queue based on publish/subscribe mode. At present, Kafka has been positioned as a distributed stream processing platform. It can realize the persistence of message order, support message backtracking and high-performance reading and writing, and it has high throughput, persistence, horizontal scalability, and support for stream data processing, etc. widely used for various properties.

2. Kafka Architecture

Kafka overall architecture diagram contains several concepts:
(1) ZooKeeper: Zookeeper is responsible for saving broker cluster metadata and electing controllers.
(2) Producer: The message producer is the client that sends messages to the kafka broker.
(3) Broker: An independent Kafka server is called a broker. A cluster consists of multiple brokers, and one broker can accommodate multiple topics. The broker is responsible for receiving messages from producers, setting offsets for messages, and storing messages on disk. The broker serves consumers, responding to requests to read partitions, returning messages that have been committed to disk.
(4) Consumer: message consumer, the client that fetches messages from kafka broker.
(5) Consumer Group: Consumer group, a consumer group can contain one or more Consumers. Each consumer in the consumer group is responsible for consuming data from different partitions, and one partition can only be consumed by one consumer in the group. Consumer groups do not affect each other. All consumers belong to a certain consumer group, that is, a consumer group is logically a subscriber. Using the multi-partition + multi-consumer method can greatly improve the processing speed of downstream data. Consumers in the same consumer group will not consume messages repeatedly. Similarly, consumers in different consumer groups will not affect each other when consuming messages. Kafka implements message P2P mode and broadcast mode through consumer groups.
(6) Topic: Messages in Kafka are divided into Topic units, which can be understood as a queue. The producer sends the message to a specific topic, and the consumer is responsible for subscribing to the topic's message and consuming it.
(7) Partition: In order to achieve scalability, a very large topic can be distributed to multiple brokers (servers), a topic can be divided into multiple partitions, and each partition is an ordered queue. The messages contained in different partitions under the same topic are different. The partition can be regarded as an appendable log (Log) file at the storage level.
A specific offset is assigned when appending to a partitioned log file.
(8) Offset: Each message in the partition is assigned an ordered id, the offset. The offset does not span partitions, which means that Kafka guarantees partition order rather than topic order.
(9) Replica: Replica, in order to ensure that when a node in the cluster fails, the partition data on the node is not lost, and Kafka can still continue to work, Kafka provides a replication mechanism, each partition of a topic has several A replica, a leader and several followers. Usually only the leader copy provides read and write services to the outside world. When the broker where the master copy is located crashes or a network exception occurs, Kafka will reselect a new leader copy under the management of the Controller to provide read and write services to the outside world.
(10) Record: The message record that is actually written to Kafka and can be read. Each record contains key, value and timestamp.
(11) Leader: The "primary" copy of multiple copies of each partition, the object that the producer sends data, and the object that consumers consume data are all leaders.
(12) Follower: The "slave" copy in multiple copies of each partition synchronizes data from the leader in real time and maintains synchronization with the leader's data. When the leader fails, a follower will become the new leader.
(13) ISR (In-Sync Replicas): A replica synchronization queue, which represents a set of replicas (including the leader itself) that are synchronized with the leader. If the follower does not synchronize data with the leader for a long time, the replica will be kicked out of the ISR queue. A leader failure will elect a new leader from the ISR.
(14) OSR (Out-of-Sync Replicas): The replica that was kicked out of ISR due to high synchronization delay has OSR.
(15) AR (Assigned Replicas): All replica sets, ie AR = ISR + OSR.

3. There are so many message systems for publishing and subscribing, why choose Kafka? (Features of Kafka)

(1) Multiple producers
Kafka can seamlessly support multiple producers, whether clients use one topic or multiple topics. Kafka is suitable for collecting data from multiple front-end systems and serving data off-heap in a uniform format.

(2) Multiple consumers
Kafka supports multiple consumers reading data from a single message stream, and consumers do not affect each other. This is different from other queuing systems, where once it is read by a client, other clients can no longer read it. And multiple consumers can form a consumer group, they share a message stream, and ensure that the consumer group consumes each given message only once.

(3) Disk-based data storage (persistence, reliability)
Kafka allows consumers to read messages in non-real-time because Kafka submits messages to disk and sets retention rules to save them without worrying about message loss.

(4) Scalability, scalability
Scalable for multiple brokers. Users can start with a single broker and expand to multiple brokers later.

(5) High performance (high throughput, low latency)
Kafka can easily handle millions of millions of message streams while ensuring sub-second message latency. The message persistence capability is provided in a time complexity of O(1), and the access performance of constant time complexity can be guaranteed even for data above TB level. Even on very cheap commercial machines, a single machine can support the transmission of more than 100K messages per second. Kafka writes to disk sequentially, so the efficiency is very high. It has been verified that the efficiency of sequential writing to disk is higher than that of random writing to memory, which is a very important guarantee for Kafka's high throughput rate.

Compare the picture:

4. How does Kafka achieve high throughput/high performance?

Kafka achieves high throughput and performance mainly through the following:

1. Page caching technology
Kafka implements file writing based on the page cache of operating system . The operating system itself has a layer of cache, called page cache , which is a cache in memory, we can also call it os cache , which means the cache managed by the operating system itself. When Kafka writes to disk files, it can directly write to this os cache , that is, only write to memory, and then the operating system decides when to actually flush the data in os cache to disk in the file. Through this step, the performance of disk file writing can be greatly improved, because in fact, it is equivalent to writing to memory, not to disk.

2. Disk sequential write
Another main function is that when Kafka writes data, it is written in a disk-sequential manner. That is, just append the data to the end of the log file, not modify the data at random locations in the file. For the same disk, sequential writes can reach 600M/s, while random writes are only 100K/s. This is related to the mechanical mechanism of the disk. The reason why sequential writing is fast is that it saves a lot of head addressing time.

Based on the above two points, Kafka achieves ultra-high performance for writing data.

3, zero copy
Everyone should know that data is often consumed from Kafka, so the consumption is actually to read a certain piece of data from Kafka's disk file and send it to downstream consumers, as shown in the following figure:

Then if you frequently read data from the disk and send it to the consumer, two unnecessary copies will be added, as shown below:

One is copied from the cache of the operating system to the cache of the application process, and then copied from the cache of the application back to the Socket cache of the operating system. And in order to make these two copies, there are several context switches in the middle. For a while, the application is executing, and then the context switches to the operating system for execution. So reading data in this way is more performance-intensive.

In order to solve this problem, Kafka introduces zero-copy technology when reading data.

That is to say, the data in the cache of the operating system is directly sent to the network card and then transmitted to the downstream consumers. The steps of copying the data twice are skipped in the middle. Only one descriptor will be copied in the socket cache, and no data will be copied. to the Socket cache, as shown in the following figure:

Through the zero-copy technology, there is no need to copy the data in the os cache to the application cache, and then from the application cache to the Socket cache. Both copies are omitted, so it is called zero copy. The Socket cache is just to copy the descriptor of the data, and then the data is directly sent from the os cache to the network card. This process greatly improves the performance of reading file data during data consumption. When Kafka reads data from the disk, it will first check whether there is any in the os cache memory. If there is, the data is actually read directly from the memory. After the Kafka cluster is well tuned, the data is directly written to the os cache, and then the data is also read from the os cache when reading the data. It is equivalent to Kafka providing data writing and reading based entirely on memory, so the overall performance will be extremely high.

5. The relationship between Kafka and Zookeeper

Kafka uses Zookeeper to save the metadata information and consumer information (offset) of the cluster. Without Zookeeper, Kafka cannot work. On Zookeeper, there will be a dedicated point for recording the Broker server list, and the node path is /brokers/ids.

When each Broker server is started, it will register with Zookeeper, that is, create a node of /brokers/ids/[0-N], and then write information such as IP, port, etc. The Broker creates a temporary node, so once the Broker goes online Or offline, the corresponding Broker node is deleted, so the broker server availability can be dynamically characterized by the change of the Broker node on zookeeper.

6. The execution process of the producer sending a message to Kafka

As shown below:

(1) When a producer wants to send a message to Kafka, it needs to create a ProducerRecoder. The code is as follows:

ProducerRecord<String,String> record 
      = new ProducerRecoder<>("CostomerCountry","Precision Products","France");
      try{
      producer.send(record);
      }catch(Exception e){
        e.printStackTrace();
      }

(2) The ProducerRecoder object will contain the target topic, partition content, and the specified key and value. When sending the ProducerRecoder, the producer will first serialize the key and value object into a byte array, and then transmit it on the network.

(3) When a producer sends a message to a topic, it needs to go through an interceptor, a serializer, and a partitioner.

(4) If the message ProducerRecord does not specify the partition field, then it needs to rely on the partitioner to calculate the value of the partition according to the key field. The role of the partitioner is to assign partitions to messages.

If no partition is specified, and the key of the message is not empty, use murmur's Hash algorithm (non-encrypted Hash function, with high computing performance and low collision rate) to calculate partition allocation.

If no partition is specified, and the key of the message is also empty, a partition is selected by polling.

(5) After the partition is selected, the message will be added to a record batch, and all messages in this batch will be sent to the same topic and partition. Then there will be a separate thread responsible for sending these batches of records to the corresponding broker.

(6) After the leader receives the Msg, it writes the message to the local log. If successfully written to Kafka, it will return a RecordMetaData object, which contains Topic and Partition information, as well as the offset recorded in the partition.

(7) If the writing fails, an error exception is returned. The producer tries to resend the message after receiving the error. If it fails after several times, it returns an error message.

(8) Followers pull messages from the leader, write them to the local log, and send ACK to the leader. After the leader receives ACKs from all replicas in the ISR, it increases the high water mark and sends ACKs to the producer.

7. How does Kafka ensure that messages of the corresponding type are written to the same partition?

This is achieved by the message key and partitioner . The partitioner generates an offset for the key, and then uses the offset to modulo the topic partition and select the partition for the message, so as to ensure that messages containing the same key will be written to on the same partition.

If the ProducerRecord does not specify a partition, and the key of the message is not empty, the Hash algorithm (non-encrypted hash function, with high computing performance and low collision rate) is used to calculate the partition allocation.

If the ProducerRecord does not specify a partition, and the key of the message is also empty, a partition is selected in a round-robin manner.

8. Kafka file storage mechanism

In Kafka, a Topic is divided into multiple Partitions, and Partitions are composed of elements of multiple smaller Segments. The representation of Partition on the server is a folder one by one. There will be multiple groups of segments (logical grouping, not real) under each partition folder. Each segment corresponds to three files: .log file, .index file. , .timeindex file. Topic is a logical concept, while partition is a physical concept. Each partition corresponds to multiple log files, and the log files store the data produced by the producer. The data produced by the Producer will be continuously appended to the end of the log file, and each piece of data has its own offset. Each consumer in the consumer group will record in real time which offset it consumes to, so that when the error is recovered, it can continue to consume from the last position.

Kafka will determine the size of a single segment file (log) according to the configuration of log.segment.bytes. When the written data reaches this size, a new segment will be created.

9. How to find the corresponding Message according to the offset?

Each index entry occupies 8 bytes and is divided into two parts:
(1) relativeOffset : relative offset, indicating the offset of the message relative to baseOffset, occupying 4 bytes (relativeOffset = offset - baseOffset), the file name of the current index file is the value of baseOffset.

For example: the baseOffset of a log segment is 32, then its file name is 0000000000000000032.log, and the relativeOffset value of the message with offset=35 in the index file is 35-32=3

(2) position: Physical address, that is, the physical location of the message in the log segment file, occupying 4 bytes.

(1) First find the segment file where the message with offset=3 is located (use the binary search method), and first determine whether the offset (baseOffset) of the .index file name is less than 3;
If it is less than, continue to compare with the next .inde file name offset;
If it is greater than, return the .index file that was less than 3 last time, and what is found here is the first segment file.

(2) For the .index file in the found segment, subtract the offset of the .index file name from the searched offset (relativeOffset = offset - baseOffset), that is, the 00000.index file. The message with offset 3 we want to find is in this file. The index in the .index file is 3 (the index is sparsely stored, it will not create an index for each message, but build an index every 4k or so to avoid the index file taking up too much space. The disadvantage is that there is no The offset of the index cannot be located to the position of the message at one time, and a sequential scan needs to be done, but the scan range is very small).

(3) According to the found index whose relative offset is 3, determine that the physical offset address of message storage is 756.

(4) According to the physical offset address, go to the .log file to find the corresponding Message

Similarly, what if I want to find the Message data corresponding to offset=8?

(1) First, find the corresponding 00000000000000000006.index index file of the segment according to the binary search method

(2) Find the position in the corresponding index file according to offset=8, which saves an offset 326, and find the corresponding message Message-8 in the 00000000000000000006.log file according to the offset 326.

Kafka's Message storage uses partitioning, disk sequential read and write, segmentation and sparse indexing to achieve high efficiency. After version 0.9, the offset has been directly maintained in the topic __consumer_offsets of the Kafka cluster.

10. What information is included in a message sent by the Producer?

The message consists of variable length header, variable length opaque key byte array and variable length opaque value byte array.

RecordBatch is the storage unit of Kafka data, a RecordBatch contains multiple Records (that is, we usually say a message). The meaning of each field in RecordBatch is as follows:

A RecordBatch can contain multiple messages, namely the Record in the above figure, and each message can contain multiple Header information, which is in the form of Key-Value.

11. How Kafka implements message ordering

1. Producer side

The sender of Kafka sends messages. If the default parameters are not set, the messages can be sent to the Kafka server in batches in the order in which the messages are sent if there is no network jitter. However, once the network fluctuates, messages can go out of order.

Therefore, to strictly ensure that Kafka sends messages in an orderly manner, we must first consider sending messages synchronously.

There are two ways to send messages synchronously:

1. Set the message response parameter acks > 0, preferably -1.

Then, set

max.in.flight.requests.per.connection = 1

After this setting is completed, on the sender side of Kafka, after a message is sent, the response must meet the parameters set by acks before sending the next message. Therefore, although in use, it is still sent asynchronously, in fact, the bottom layer is already sent one by one.

2. After calling the send method of KafkaProducer, call the get method of the Future object returned by the send method to block and wait for the result. After the result is returned, continue to call the send method of KafkaProducer to send the next message.

In addition to sending messages synchronously, the problem of message retransmission should also be considered.

When there is a problem with sending, the Kafka sender can determine whether the problem can be automatically recovered. If it is a problem that can be automatically recovered, Kafka can automatically retry by setting retries > 0.

Depending on the version of Kafka, after Kafka 1.0, the sender introduced the idempotent feature. Introducing the idempotent feature, we can set it like this

enable.idempotence = true

Idempotent feature This feature can add a sequence number to the message, and each time it is sent, the sequence number is incremented by 1.

After enabling the idempotent feature of the Kafka sender, we can set

max.in.flight.requests.per.connection = 5

In this way, when Kafka sends a message, since the message has a serial number, when there is an error in sending the message, the bottom layer of Kafka will obtain the serial number of the last few logs on the server side and the serial number of the message that the sender needs to resend. In contrast, if it is continuous, then you can continue to send messages to ensure the order of the messages.

2. Broker side

Kafka's Topic is just a logical concept. The partitions that make up the Topic are where the real messages are stored.

Kafka only guarantees that messages within the same partition are ordered. Therefore, if you want to ensure that the business is strictly ordered globally, you must set the Topic as a single partition.

However, often our business does not need to consider the overall order, we only need to ensure the order of different types of messages in the business. Different types of messages in these services can be set to different Keys, and then modulo is obtained according to the Keys. In this way, since messages of the same type have the same Key, they will be assigned to the same partition to ensure ordering.

However, there is a problem here, that is, when we change the number of partitions, the messages that may have been allocated to the same partition before will be allocated to other partitions. This does not guarantee message order.

Faced with this situation, it is necessary to consider the impact on the business when dynamically changing the partition. It may be necessary to reclassify messages based on business and current partition requirements.

In addition, if a topic has multiple partitions and the number of replicas specified by min.insync.replicas fails, then this situation will occur: the sending message cannot be written to the corresponding partition, but the consumption can still consume the message.

At this time, we often guarantee availability and consider switching the partition of messages. Once this is done, the order of messages may be inconsistent.

Therefore, it is necessary to ensure that the min.insync.replicas parameter is properly configured to ensure the orderliness of message writing to the greatest extent possible.

3. Consumer

On the consumer side, according to the Kafka model, each partition under a topic can only belong to a certain consumer in the consumer group that listens to the topic, ensuring orderly consumption within the partition.

Suppose the number of partitions of the topic is P, and the number of consumers in the consumer group is C. Then, if P < C, the consumer will be idle; if P > C, a consumer will be allocated multiple partitions.

Therefore, when we use the poll method on the consumer side, we must pay attention: the records obtained by the poll method are likely to be in multiple partitions or even multiple topics.

It is also necessary to further sort and filter the records (TopicPartition partition) of ConsumerRecords in order to truly guarantee the sequential and consistent use of sending and consumption.

Another point to pay attention to is the consumer's Rebalance. Rebalance is the process of letting all consumer instances under a consumer group reach a consensus on how to consume all partitions of a subscription topic.

Rebalance mechanism is the most notorious place for Kafka to take care of:

Every time it rebalances, it will suspend the consumption of all consumer groups.
Then there are many bugs in Rebalance. For example, after Rebalance, either a certain consumer suddenly collapses, or some consumers in the consumer group stop.
Since Rebalance is equivalent to redistributing partitions to the consumer group, this may cause the partitions corresponding to consumers before and after Rebalance to be inconsistent. If the partitions are inconsistent, the natural consumption order cannot be consistent.

Therefore, we will try our best not to let Rebalance happen.

are three situations that will trigger the rebalance of Kafka consumers:

1. Changes in the members of the consumer group: This often means that we believe that the number of consumers in the group has been increased or decreased, or that some consumers have collapsed, resulting in being kicked out of the group.

2. The number of subscription topics changes: Kafka's consumer group can use regular rules to fuzzy match topics. This creates a problem. When we add topics in Kafka, the number of topics monitored by the consumer group may change.

3. The number of partitions of the subscription topic changes: Sometimes, we may want to dynamically change the number of partitions of the topic online.

Therefore, when these three situations trigger Rebalance, there will be problems, and the inconsistent consumption order is only a slight negative impact.

The entire Kafka is not guaranteed to be ordered. If you want to ensure the global order of Kafka, then set up one producer, one partition, and one consumer.

12. What partitioning algorithms does Kafka have?

Kafka includes three partitioning algorithms:

(1) Polling Policy

Also known as Round-robin strategy, that is, sequential allocation. For example, if there are 3 partitions under a topic, then the first message is sent to partition 0, the second message is sent to partition 1, the third message is sent to partition 2, and so on. It starts again when the fourth message is produced.

The polling strategy is the default partition strategy provided by the Kafka java producer API. The polling strategy has excellent load balancing performance. It can always ensure that messages are distributed evenly to all partitions to the greatest extent possible. Therefore, it is the most reasonable partitioning strategy by default, and it is also one of the most commonly used partitioning strategies.

(2) Random strategy

Also known as Randomness strategy. The so-called random is that we randomly place the message on any partition, as shown below:

(3) Allocation strategy by key

kafka allows defining a message key, abbreviated as key, for each message. Once a message is defined with a key, then you can ensure that all messages with the same key enter the same partition, because the message processing under each partition is sequential, as shown in the following figure:

13. Kafka's default message retention policy

The broker's default message retention policy is divided into two types:
Log segments are configured via log.segment.bytes (default is 1GB)
Log segments are configured via log.segment.ms (default 7 days)

14. How does Kafka implement message replication between a single cluster?

The Kafka message responsibility mechanism can only be replicated within a single cluster, not across multiple clusters.

Kafka provides a core component called MirrorMaker, which consists of a producer and a consumer, which are connected through a queue. When consumers read messages from one cluster, producers send messages to another cluster. .

15. Kafka message confirmation (ack response) mechanism

To ensure that the data sent by the producer can reliably reach the specified topic, the producer provides a message confirmation mechanism. When a producer sends a message to a broker's topic, it can be configured to determine how many replicas receive the message before the message is sent successfully. It can be specified through the acks parameter when defining the Producer. This parameter supports the following three values:

(1) acks = 0: The producer will not wait for any response from the broker.

Features: Low latency, high throughput, data may be lost.

If there is a problem and the broker does not receive the message, then the producer has no way of knowing, and the message will be lost.

(2) acks = 1 (default): As long as the leader node of the partition in the cluster receives the message, the producer will receive a successful response from the server.

If the leader fails before the followers are synchronized, data will be lost.

The throughput at this time mainly depends on whether synchronous sending or asynchronous sending is used, and the throughput is also limited by the number of messages in the send, such as how many messages the producer can send before receiving a response from the broker.

(3) acks = -1: The producer will receive a successful response from the server only when all nodes participating in the replication have all received the message.

This mode is the safest and ensures that more than one server receives the message, and even if a server crashes, the entire cluster can still run.

According to the actual application scenario, choose to set different acks to ensure the reliability of the data.

In addition, the Producer can also choose synchronous or asynchronous mode for sending messages. If it is set to asynchronous, although the performance of message sending will be greatly improved, it will increase the risk of data loss. If you need to ensure message reliability, you must set producer.type to sync.

#同步模式
producer.type=sync 
#异步模式
producer.type=async

16. Tell me what is a copy?

In order to ensure that data is not lost, Kafka has introduced a partition copy mechanism since version 0.8.0. Specify the replication-factor when creating the topic, and the default replica is 3.

Replicas are relative to partitions. A partition contains one or more replicas, one of which is a leader replica, and the rest are follower replicas. Each replica is located in a different broker node.

All read and write operations are performed by the leader, and the follower will periodically go to the leader to replicate data. When the leader hangs up, one of the followers will become the new leader again. Through the partition copy, data redundancy is introduced, and the data reliability of Kafka is also provided.

Kafka's partitioned multi-replica architecture is the core of Kafka's reliability assurance. Writing messages to multiple replicas enables Kafka to ensure message durability even in the event of a crash.

17. Kafka's ISR mechanism

In a partition, all replicas are collectively referred to as AR, and the leader maintains a dynamic in-sync replica (ISR). ISR refers to the replica set that is synchronized with the leader replica. Of course the leader replica itself is also a member of this set.

When the follower in the ISR completes data synchronization, the leader will send ack to the follower. If one of the followers does not synchronize data with the leader for a long time, the follower will be kicked out of the ISR set. The time threshold is determined by replica.log.time. max.ms parameter setting. When the leader fails, a new leader is re-elected from the ISR set.

18. What do LEO, HW, LSO and LW stand for?

LEO : is the abbreviation of LogEndOffset, which represents the next position in the current log file, for each copy.

HW : The term watermark or watermark, also known as high watermark. Usually used in the field of stream processing (Flink, Spark) to characterize the progress of elements or events on a time-based level. In Kafka, HW is for partitions, and the concept of water level is not related to time, but to location information. Strictly speaking, it represents position information, that is, offset. Take the smallest LEO in the ISR corresponding to the partition as the HW, and the consumer can only consume at most the last piece of information where the HW is located.

LSO : is the abbreviation of LastStableOffset. For unfinished transactions, the value of LSO is equal to the position of the first message in the transaction (firstUnstableOffset), and for completed transactions, its value is the same as HW.

LW : Low Watermark Low watermark, representing the smallest logStartOffset value in the AR set.

19. Partition redistribution of partition management

When a node in the cluster suddenly goes offline, if the partitions on the node are single-copy, then these partitions become unavailable, and the corresponding data will be lost until the node is restored; The partition is multi-replica, then the role of the leader replica on this node will be transferred to other follower replicas in the cluster. All in all, the partition replicas on this node are already in a state of failure. Kafka will not automatically migrate these failed partition replicas to the remaining available broker nodes in the cluster. If left alone, it will not only affect the entire cluster. Balancing the load also affects the availability and reliability of the overall service.

When a node in the cluster needs to be offline in a planned manner, in order to ensure the reasonable allocation of partitions and replicas, we also hope that the partition replicas on the node can be migrated to other available nodes in some way.

When a broker node is added to the cluster, only the newly created topic partition can be assigned to this node, and the previous topic partition will not be automatically assigned to the newly added node, because there is no new topic partition when they are created. This new node, so that the load of the new node is severely unbalanced between the load of the original node.

In order to solve the above problems, it is necessary to make a reasonable allocation of partition copies again, which is the so-called partition redistribution. Kafka provides the kafka-reassign-partitions.sh script to perform partition redistribution. It can migrate partitions in the scenario of cluster expansion and broker node failure.

The use of the kafka-reassign-partitions.sh script is divided into 3 steps:
(1) First create a JSON file that requires a list of topics;
(2) Secondly, generate a redistribution plan based on the topic list and the broker node list;
(3) Finally, the specific redistribution action is executed according to this plan.

Partition reallocation has a large impact on cluster performance and requires additional resources such as network and disk. In practice, we will reduce the granularity of redistribution and execute it in multiple small batches to minimize the negative impact, which is similar to the election of priority replicas.

It should also be noted that if a broker is to be taken offline, it is best to shut down or restart the broker before performing the partition reassignment action. In this way, the broker is no longer the leader node of any partition, and its partition can be assigned to other brokers in the cluster. This reduces traffic duplication between brokers, thereby improving redistribution performance and reducing the impact on the cluster.

20. How to conduct partition leader election?

The election of the partition leader replica is implemented by the controller ( Controller ).

1. When creating a partition (creating a topic or adding a partition has the action of creating a partition) or when the partition goes online (for example, the original leader copy in the partition goes offline, at this time, the partition needs to elect a new leader to go online to provide services to the outside world). The leader election action needs to be executed, and the corresponding election strategy is OfflinePartitionLeaderElectionStrategy. The basic idea of this strategy is to find the first surviving replica in the order of replicas in the AR set, and this replica is in the ISR set. A partition's AR set is specified at allocation time, and as long as no reallocation occurs, the order of replicas within the set remains unchanged, while the order of replicas in a partition's ISR set may change.

Note : Here the election is based on AR order not ISR order.

If there are no replicas available in the ISR set, then also check the configured unclean.leader.election.enable parameter at this time (default value is false ). If this parameter is configured to true, it means that the leader is allowed to be elected from the non-ISR list, and the first surviving replica found from the AR list is the leader.

2. When the partition is reassigned, the leader election action also needs to be performed, and the corresponding election strategy is ReassignPartitionLeaderElectionStrategy. The idea of this election strategy is relatively simple: find the first surviving replica from the reallocated AR list, and this replica is in the current ISR list.

3. When the election of the preferred replica occurs, the corresponding election strategy is PreferredReplicaPartitionLeaderElectionStrategy. Simply set the priority copy as the leader, and the first copy in the AR set is the priority copy.

4. There is another situation where leader election occurs. When a node is gracefully shut down (that is, ControlledShutdown is executed), the leader copy located on this node will go offline, so the corresponding partition needs to perform leader election. The corresponding election strategy is ControlledShutdownPartitionLeaderElectionStrategy. That is, find the first surviving replica from the AR list, and this replica is in the current ISR list, and at the same time make sure that this replica is not on the node that is being shut down.

21. Kafka transactions

Kafka introduced transaction support in version 0.11. Transactions can ensure that based on the Exactly Once semantics, Kafka can produce and consume across partitions and sessions, and either all succeed or all fail.

Producer Transaction

In order to realize cross-partition and cross-session transaction, it is necessary to introduce a globally unique Transaction ID, and bind the PID obtained by the Producer to the Transaction ID. In this way, when the Producer is restarted, the original PID can be obtained through the ongoing Transaction ID.

To manage transactions, Kafka introduces a new component, Transaction Coordinator. The Producer obtains the task status corresponding to the Transaction ID by interacting with the Transaction Coordinator. The Transaction Coordinator is also responsible for writing the transaction to an internal topic of Kafka, so that even if the entire service is restarted, since the transaction state is saved, the transaction state in progress can be restored and the process can continue.

Consumer Transaction

The above transaction mechanism is mainly considered from the perspective of the Producer. For the Consumer, the guarantee of the transaction will be relatively weak, especially the information of the Commit cannot be guaranteed to be accurately consumed. This is because the Consumer can access any information through offset, and different Segment Files have different life cycles, and the messages of the same transaction may be deleted after restarting.

22. What is the relationship between Kafka's consumer groups and partitions?

(1) In Kafka, consumers are managed through consumer groups. Assuming that a topic contains 4 partitions, there is only one consumer in a consumer group. That consumer will receive messages for all 4 partitions.

(2) If there are two consumers, then the four partitions will be allocated two consumers according to the partition allocation strategy.

(3) If there are four consumers, they will be distributed equally, and each consumer consumes one partition.

(4) If there are 5 consumers, there will be more consumers than the number of partitions, and the redundant consumers will be idle and will not receive any information.

23. How to ensure that each application can get all the messages in the Kafka topic, not some of the messages?

Create a consumer group for each application, and then add consumers to the group to scale reading and processing capabilities. Each group consumes messages in the topic without interfering with each other.

24. How to realize that Kafka consumers only consume a specified number of messages at a time?

Write a queue, use the consumer as an attribute of the queue class, and then increment a counter for the consumption count. When the specified number is reached, the consumer is closed.

25. How does Kafka implement multi-threaded consumption?

Kafka allows multiple partitions in the same group to be consumed by one consumer, but does not allow one partition to be consumed by multiple consumers in the same group.

The steps to implement multithreading are as follows:

Producers submit data to random partitions (custom random partitions).
Consumers modify the single-threaded mode to multi-threaded. In terms of consumption, they must traverse all partitions, otherwise only one area will be consumed.

26. How many consumption modes does Kafka consumption support?

Kafka supports three modes when consuming messages:

(1) at most once mode
At most once. Ensure that each message commit is successful before consumption processing. Messages may be lost but not duplicated. If the producer does not retry when the ack times out or returns an error, the message may not end up being written to Kafka and therefore not delivered to the consumer. In most cases, this is done to avoid the possibility of duplication, and the business must accept possible loss of data delivery.

(2) at least once mode
At least once. Make sure to commit after each message is processed successfully. Messages are not lost, but may be duplicated. If the producer receives an ack from the Kafka broker or acks = all, it means the message has been written to Kafka. But if the producer ack times out or gets an error, it may retry sending the message, which the client thinks was not written to Kafka. If the broker fails before sending the Ack, but after the message is successfully written to Kafka, this retry will cause the message to be written twice, so the message is delivered to the final consumer more than once, this strategy can lead to duplicate work and incorrect results.

(3) exactly once mode
Passed exactly once. Process the offset as a unique id at the same time as the message, and guarantee the atomicity of the processing. Messages are processed only once, not lost or duplicated. But it's hard to do this way. Even if the producer retries sending the message, the message is guaranteed to be delivered to the final consumer at most once. This semantics is ideal, but it is also difficult to achieve because it requires the messaging system itself to cooperate with the applications that produce and consume messages. For example, if we rollback the offset of the Kafka consumer after consuming the message successfully, we will start receiving messages from that offset again. This suggests that the messaging system and client application must work together to achieve exactly-once.

Kafka's default mode is at least once, but this mode may cause repeated consumption problems, so idempotent design must be done in business logic.

The INSERT INTO ... ON DUPLICATE KEY UPDATE syntax is used when saving data in business scenarios, inserting when it does not exist, and updating when it exists, which naturally supports idempotency.

27. How does Kafka ensure non-duplication and non-loss of data (Exactly Once semantics)?

1. Exactly once mode

Passed exactly once. Process the offset as a unique id at the same time as the message, and guarantee the atomicity of the processing. Messages are processed only once, not lost or duplicated. But it's hard to do this way.

Kafka's default mode is at least once, but this mode may cause repeated consumption problems, so idempotent design must be done in business logic.

2, idempotent

When the Producer produces and sends messages, it is inevitable that messages will be sent repeatedly. When the Producer performs retry, a retry mechanism will occur, and messages will be sent repeatedly. After the introduction of idempotency, repeated sending will only generate a valid message.

Specific implementation of idempotency: each Producer will be assigned a unique PID when it is initialized. This PID is transparent to the application and is not exposed to the user at all. For a given PID, the sequence number will be incremented from 0. When the Producer sends data, it will identify a sequence number for each msg, and the broker uses this to verify whether the data is repeated. The PID here is globally unique, and a new PID will be assigned after the Producer restarts after a failure, which is one reason why idempotency cannot be achieved across sessions. Each Topic-Partition on the broker also maintains the mapping of pid-seq, and the lastSeq is updated every Commit. In this way, when the Record Batch arrives, the broker will check the Record Batch before saving the data. If the baseSeq (seq of the first message) in the batch is 1 greater than the sequence number (lastSeq) maintained by the Broker, the data is saved, otherwise it is not saved.

3. At Least Once + idempotency = Exactly Once , which can ensure that the data is not duplicated or lost.

28. How does Kafka clean up expired data?

Kafka persists data to the hard disk, allowing you to configure certain policies for data cleaning. There are two cleaning strategies, deletion and compression.

How to clean data

1. Delete

log.cleanup.policy=delete enable delete policy

Delete directly, the deleted messages cannot be recovered. The following two policies can be configured:

#清理超过指定时间清理：  
log.retention.hours=16
#超过指定大小后，删除旧的消息：
log.retention.bytes=1073741824

In order to avoid blocking the read operation during deletion, the implementation in the form of copy-on-write is adopted. When the deletion operation is performed, the binary search function of the read operation is actually performed on a static snapshot copy, which is similar to Java's CopyOnWriteArrayList. .

2. Compression

Compress the data, keeping only the data of the last version of each key.

First set log.cleaner.enable=true in the broker's configuration to enable the cleaner, which is disabled by default.

Set log.cleanup.policy=compact in topic's configuration to enable compaction policy.

29. Kafka and CAP Theory

As the basic theory of distributed systems, CAP theory describes that a distributed system can only satisfy at most two of the three: Consistency, Availability, and Partition tolerance.

1. The meaning of CAP

Consistency
Meaning: All nodes access the same latest copy of the data (the same at the same time).

Any read operation started after the write operation has completed must return this value, or the result of a subsequent write operation. That is, in a consistent system, once a client writes a value to any one server and gets a response, then what the client reads from any other server is the data that was just written.

Meaning: Every request received by a non-faulty node in the system must have a response.

In a usable system, if our client sends a request to the server, and the server doesn't crash, the server must eventually respond to the client, not allowing the server to ignore the client's request.

Partition tolerance
Meaning: When a node or network partition in the distributed system fails, the entire system can still provide external services that meet consistency and availability, that is, a partial failure does not affect the overall use.

In fact, when we design a distributed system, we will consider failures caused by various reasons such as bugs, hardware, and networks, so even if some nodes or networks fail, we require the entire system to continue to be used (not continuing to use, equivalent to There is only one partition, so there is no subsequent consistency and availability)

2. Note:
(1) Not at any time, one of C and A should be discarded. In the absence of partition problems, distributed systems should have perfect data consistency and availability.

(2) The selection of C and A is not necessarily for the entire system, but can be done in stages and moments. For example, the payment subsystem related to the accounting flow must choose strong consistency; such as user name, user avatar, user level and other related subsystems can choose A.

(3) The three characteristics of CAP are not Boolean type, binary opposition, and either black or white. All three are scoped. For example, when emphasizing consistency, usability is not completely abandoned.

(4) How do CAP three weigh

CA system : Focused on consistency and availability, it requires very strict unanimous protocols such as "two-phase commit protocol" (2PC). The CA system cannot tolerate network errors or node errors. Once such a problem occurs, the entire system will refuse the write request, because it does not know whether the opposite node is down or it is just a network problem. The only safe thing to do is to make yourself read-only.

Sadly, this is almost non-existent. Because of distributed systems, network partitions are inevitable. If you want to give up P, then you have to give up distributed systems, and CAP is out of the question. It can be said that P is the premise of distributed systems, so this situation does not exist.

For example, general relational databases, such as MySQL or Oracle, all guarantee consistency and availability, but they are not distributed systems. From this point of view CAP is not equivalent, we cannot improve P by sacrificing CA. To improve partition fault tolerance, it can only be achieved by improving the stability of the infrastructure. That said, this is not a software problem.

CP System : Focus on consistency and partition tolerance. It focuses on the consensus protocol of the majority of people in the system, such as the Paxos algorithm (a Quorum-like algorithm). Such a system only needs to ensure that most nodes have consistent data, and a few nodes will become unavailable when they are not synchronized to the latest version of the data. This can provide some availability.

A system guarantees consistency and partition fault tolerance, giving up availability. That is to say, in extreme cases, the situation that the system cannot be accessed is allowed. At this time, the user experience is often sacrificed, and the user is kept waiting until the system data is consistent, and then the service is restored.

For some systems, consistency is the foundation of life, such as distributed storage such as Hbase and Redis, data consistency is the most basic requirement. Storage that does not satisfy consistency obviously will not be used by users.

The same is true for ZooKeeper, any access to ZK can achieve consistent results. Its responsibility is to ensure that the services under its jurisdiction remain synchronized and consistent, and it is obviously impossible to give up consistency. But in extreme cases, ZK may drop some requests and consumers need to re-request to get results.

AP systems : Such systems care about availability and partition tolerance. Therefore, such a system cannot achieve consistency, data conflicts need to be given, and data versions need to be maintained when data conflicts are given.

This is the design of most distributed systems, ensuring high availability and partition fault tolerance, but at the expense of consistency. For example, Taobao shopping and 12306 ticket purchases, etc. As mentioned earlier, Taobao can achieve an ultra-high level of five 9s in annual availability, but at this time, data consistency cannot be guaranteed.

For example, we often meet when we buy tickets at 12306. When we clicked to buy, the system did not prompt that there was no ticket. When we enter the verification code, we will be notified when payment is made that there are no tickets. This is because when we clicked to buy, the data did not reach a consensus, and the lack of remaining tickets was only checked during payment verification. This design will sacrifice some user experience, but it can ensure high availability, so that users will not be unable to access or wait for a long time, which is also a trade-off.

The key point of weighing the three depends on the business.

If the consistency is abandoned and the partition fault tolerance is satisfied, then the nodes may lose contact. For high availability, each node can only provide services with local data, which will easily lead to global data inconsistency. For Internet applications (such as Sina, NetEase), the number of machines is large, the nodes are scattered, and network failures are normal. Then, at this time, it is the scenario of ensuring AP and abandoning C. From the actual understanding, such as portal websites, occasionally No consistency is acceptable, but no access is a huge problem.

For banks, it is necessary to ensure strong consistency, that is to say, C must exist, then only CA and CP are used. When strong consistency and availability (CA) are guaranteed, in the event of a communication failure, the system will be completely unavailable. On the other hand, if strong consistency and partition tolerance (CP) are guaranteed, then there is partial availability. What should be chosen actually needs to be weighed by business scenarios (not all cases are that CP is better than CA, and it is sometimes better to directly deny service if you can only view information but not update information).

3. CAP mechanism in Kafka

Kafka satisfies CA in the CAP law, in which Partition tolerance uses a certain mechanism to ensure partition fault tolerance as much as possible. Where C stands for data consistency. A indicates data availability.

Kafka first writes the data to different partitions, and each partition may have multiple copies. The data is first written to the leader partition, and the read and write operations are all communicated with the leader partition to ensure data consistency. The principle of sexuality, that is, the principle of Consistency is satisfied. Then Kafka ensures the availability of data in Kafka through the partition copy mechanism. But there is also another problem, that is, how to solve the problem of the difference between the data in the replica partition and the data in the leader, which is the problem of Partition tolerance.

In order to solve the problem of partition tolerance, Kafka uses the synchronization strategy of ISR to minimize the problem of partition tolerance.

Each leader maintains a list of ISRs (a set of in-sync replicas, basically synchronized). The main function of the ISR list is to determine which replica partitions are available, that is to say, the data in the leader partition can be synchronized to the replica partition, and there are two conditions for determining whether a replica partition is available:

replica.lag.time.max.ms=10000 The heartbeat time between the replica partition and the primary partition is delayed. If this time is exceeded, the ISR will be kicked out
replica.lag.max.messages=4000 means that the number of messages behind the leader of a replica exceeds the value of this parameter, then the leader will delete the follower from the ISR (this parameter was removed in version 0.10.0)

The acknowledgment value when the produce request is considered complete: request.required.acks=0.

30. Why does Kafka not support read-write separation?

In Kafka, the operations of the producer to write the message and the consumer to read the message all interact with the leader copy, thus realizing a production and consumption model of main write main read . Kafka does not support master write slave read , because the master write slave read has two obvious shortcomings:

(1) Data consistency problem: There must be a delay time window when data is transferred from the master node to the slave node. This time window will cause data inconsistency between the master and slave nodes. At a certain moment, the value of A data in the master node and the slave node is X, and then the value of A in the master node is modified to Y, then the application reads the A data in the slave node before the change is notified to the slave node The value of is not the latest Y, which causes the problem of data inconsistency.

(2) Delay problem: Similar to Redis, the process from writing data to the master node to synchronizing to the slave node needs to go through several stages: network → master node memory → network → slave node memory, the whole process will consume a certain time. In Kafka, master-slave synchronization is more time-consuming than Redis. It needs to go through the stages of network → master node memory → master node disk → network → slave node memory → slave node disk. For latency-sensitive applications, the master-write-slave-read function is not very suitable.

In practical applications, with an ecological platform that combines monitoring, alarming, and operation and maintenance, Kafka can achieve a large degree of load balancing in most cases. The advantages of Kafka's main writer and main reader are many:

(1) It can simplify the implementation logic of the code and reduce the possibility of errors;

(2) The load granularity is refined and evenly distributed. Compared with the master write and slave read, it not only has better load performance, but also is controllable to users;

(3) There is no effect of delay;

(4) When the copy is stable, there will be no data inconsistency.

For this reason, why does Kafka need to implement the function of master-write-slave-reader, which is useless for it? All of this benefits from Kafka's excellent architecture design. In a sense, master-write-slave-reading is due to A stopgap for design flaws.

31. Kafka's data offset reading process

(1) Connect to the ZK cluster, and get the partition information of the corresponding topic and the relevant information of the leader of the partition from ZK

(2) Connect to the broker corresponding to the corresponding Leader

(3) The consumer sends the offset saved by itself to the Leader

(4) The Leader locates the segment (index file and daily log file) according to the offset and other information

(5) According to the content in the index file, locate the start position corresponding to the offset in the daily log file, read the data of the corresponding length and return it to the consumer

32. Kafka message data backlog, how to deal with the lack of Kafka consumption capacity?

Messages pile up often because the rate at which the producer produces does not match the rate at which the consumer consumes. It may be caused by repeated retries due to the failure of message consumption, or it may be due to the weak consumption ability of consumers, and gradually the messages will be backlogged.

Therefore, we need to locate the cause of slow consumption first, and deal with the bug if it is a bug. If it is because of its weak consumption ability and the downstream data processing is not timely, we can optimize the consumption logic and increase the number of pulls per batch. If the batch pull data is too small, that is: pull data / processing time < production speed, the processed data is smaller than the production data, and data backlog is also caused. For example, it used to consume and process messages one by one, but this time we process them in batches. For example, in database insertion, the efficiency of one-by-one insertion and batch insertion are different.

Suppose we have optimized the logic, but it is still slow. If Kafka's consumption capacity is insufficient, you can consider increasing the number of partitions of the topic, and at the same time increasing the number of consumers in the consumption group, the number of consumers = the number of partitions (both are indispensable).

33. Kafka's offset maintenance

Before Kafka version 0.9, the consumer saved the offset in Zookeeper by default.

Starting from version 0.9, the consumer saves the offset in a built-in Kafka topic named: __consumer_offsets by default.

In the actual development scenario, in Spark and Flink, you can manually submit Kafka's offset, or Flink's two-stage submission
Post auto-commit offset.