Kafka is an open source stream processing platform managed by the Apache Foundation (official website http://kafka.apache.org/ ), but most people in China are of it 1613739ad58c0f message queue 1613739ad58c10, so let’s understand it first What is the message queue .
message queue
The message queue, as its name implies, is a queue for storing messages. The message producer (producer) messages in the message queue, and the consumer (consumer) reads the contents of the message queue. Each message in the message queue will have a position, just like the index in the array, which we call offset in Kafka. For producers, there is a special LEO (log end offset) , which points to the next position in the message queue where the message will be stored.
Here we focus on the consumer (consumer) . Of course, a message queue can be read by multiple consumers. Each consumer has a unique group-id to distinguish it. Kafka will also record which position each consumer (consumer) has read (offset).
Question: Why is the offset consumed by consumers recorded by Kafka instead of being recorded by consumers themselves?
Topic
Above, we took a piece of data as an example to talk about what a message queue is. What if there are multiple copies of data (multiple queues)? It is also very simple. In kafka, we can use different topics (Topic) distinguish different data. Different producers (producers) can store data in different topics, and different consumers (consumers) can also read data from different topics.
Partition
What to do when a piece of data is very large? Of course, consider split . In Kafka, you can split a topic (Topic) into multiple different partitions (partitions), and then manage, produce, and consume data in the dimension of partitions. The most obvious benefit of the split is to improve the throughput performance of , and multiple partitions are parallel and do not interfere with each other.
As for how to split, Kafka provides several default partitioning strategies polling, random, and hashing. Of course, you can implement your own partitioning strategy by yourself, so I won't go into too much here.
Consumer-group
After the topic is partitioned, how can consumers consume it? Here we have to mention the concept of Consumer-group, in Kafka, in order to ensure data consistency, the same partition (partition) can only be consumed by one consumer (consumer) instance at the same time , for To improve the throughput of consumers, multiple consumer instances are generally set up to consume different partitions. These instances together form a Consumer-group, and they share a Group-id .
Notice:
- Since the same partition (partition) can only be consumed by one consumer (consumer) instance at the same time, the number of consumer (consumer) instances exceeding the number of partitions (partition) has no meaning, and redundant consumer (consumer) instances will also Be idle.
- If an instance in the Consumer-group changes (online and offline), or the number of partitions (partition) changes, the consumer group rebalence will be triggered.
Replication
How does Kafka solve the problem of high data availability? In a distributed environment, if you want to ensure that data is not lost as much as possible, the only way is to and place them on different machines. The copied data is called Replication.
There are a few key words here.
HW: high-water, a special offset. Only messages below this offset can be read by consumers. The specific value of the high-water level depends on the status of the master-slave replica data synchronization, and will not be expanded here.
ISR: in-sync-replica, the set of replicas in a synchronized state, refers to the replicas whose replica data and the primary replica data are within a certain return (time range or quantity range), of course, the primary replica must always be in the ISR of. When the primary copy hangs up, the new primary copy will be selected from the ISR to take over its work.
OSR: corresponds to the out-sync-replica of IRS. In fact, it refers to the replicas that are not in the ISR.
Replica master-slave synchronization
When more than one copy of data is copied, the synchronization of the master-slave copy must be involved, and the slave copy will periodically pull the latest data from the master copy. In addition, it should be noted that in Kafka, only the master copy will provide external read and write (higher version of Kafka slave copy provides limited read functions), and the only role of the slave copy is to serve as a backup for the master copy.
Speaking of master-slave synchronization, let me mention Kafka's ack setting by the way.
The producer in can set the level of data reliability request.required.acks 1613739ad58e34 parameter:
- 0 : The producer does not wait for the confirmation from the master copy, and it is considered that the transmission is successful when it is sent. This situation is the most efficient but there may be a risk of data loss.
- 1 : (Default) After the producer sends the data, it will wait for the master copy to confirm the receipt before the message is considered to be sent successfully. In this case, the message may be lost when the master copy goes down.
- -1 : (or all): The producer waits for all copies in the ISR to confirm that the data is received before the task message is sent successfully. The reliability is the highest, but because it needs to be pulled from the copy and confirmed, it is efficient lowest.
Broker
Kafka manages data in a replica (Replica) dimension. To manage these data, a manager is definitely needed. This manager is Broker . Broker will schedule different copies of the same data (Replication) to different machines, and generate new copies when the number of copies (Replication) is insufficient, so as to ensure that data is not lost even after part of the Broker goes down.
All Brokers will also synchronize some metadata with each other, such as where a certain piece of master data is, and from whom to pull data from the copy...
Concluding remarks
The first time I tried to explain the introductory knowledge of kafka with hand-drawn style, it was very superficial, and indeed many details have not been expanded, forgive me.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。