1. What is kafka?
Kafka is a distributed publish/subscribe messaging system developed by LinkedIn, written in Scala, and it is widely used with horizontal scalability and high throughput.
2. Generate background
Kafka is a messaging system used as the basis for LinkedIn's Activity Stream and operational data processing pipeline (Pipeline). Activity flow data is the most common part of the data that almost all sites use when reporting on their website usage.
Activity data includes information about page views (Page View), content being viewed, and search conditions. The usual way of processing this kind of data is to first write various activities into a certain file in the form of a log, and then perform statistical analysis on these files periodically.
Operational data refers to the performance data of the server (CPU, IO usage rate, request time, service log, etc.). There are many kinds of statistical methods for operating data.
3. Basic structure diagram
4. Explanation of basic concepts
Broker
A Kafka cluster contains one or more servers, which are called brokers. The broker does not maintain the consumption status of data, which improves performance. Direct use of disks for storage, linear read and write, fast speed: avoiding data copying between JVM memory and system memory, reducing performance-consuming object creation and garbage collection.
Producer
Responsible for publishing messages to Kafka broke
Consumer
The message consumer, the client that reads the message from the Kafka broker, and the consumer pulls data from the broker and processes it.
Topic
Each message published to the Kafka cluster has a category, and this category is called Topic. (Physically different Topic messages are stored separately. Logically, although a Topic message is stored on one or more brokers, users only need to specify the Topic of the message to produce or consume data without worrying about where the data is stored.)
Partition
Parition is a physical concept, and each topic contains one or more Partitions.
Consumer Group
Each Consumer belongs to a specific Consumer Group (you can specify a group name for each Consumer, if you do not specify a group name, it belongs to the default group)
Topic & Partition
Topic can be considered as a queue logically, and each consumer must specify its Topic, which can be simply understood as the need to specify which queue to put this message into. In order to increase the throughput rate of Kafka linearly, Topic is physically divided into one or more Partitions, and each Partition physically corresponds to a folder, under which all messages and index files of this Partition are stored.
If you create two topics, topic1 and topic2, with 13 and 19 partitions, respectively, a total of 32 folders will be generated on the entire cluster (the cluster used in this article has a total of 8 nodes, where topic1 and topic2 replication-factor are both As 1).
5. Applicable scenarios
Messaging
For some conventional messaging systems, Kafka is a good choice; partitons/replication and fault tolerance can make Kafka have good scalability and performance advantages. However, so far, we should clearly realize that Kafka does not provide JMS "Transactional", "message transmission guarantee (message confirmation mechanism), "message grouping" and other enterprise-level features; Kafka can only be used as a "regular" messaging system. To a certain extent, it has not yet ensured that the sending and receiving of messages is absolutely reliable ( For example, message retransmission, message transmission loss, etc.)
Website activity tracking
Kafka can be used as the best tool for "website activity tracking"; it can send webpages/user operations and other information to kafka. Real-time monitoring, or offline statistical analysis, etc.
Metrics
Kafka is usually used for actionable monitoring data. This includes aggregate statistics from distributed applications used to produce centralized operational data feeds.
Log Aggregation
The characteristics of kafka determine that it is very suitable as a "log collection center"; applications can send operational logs to the Kafka cluster in "batch" and "asynchronous", instead of storing them locally or in the DB; Kafka can submit messages in batches/compress messages, etc. For the producer side, there is almost no performance expense. At this time, the consumer side can make hadoop and other systematic storage and analysis systems.
Finally, pay attention to the technical road of the official account of migrant workers, and you can get the message queue, middleware-related technical articles, and interview questions that I have compiled, which is very complete.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。