2
头图

1. What is kafka?

Kafka is a distributed publish/subscribe messaging system developed by LinkedIn, written in Scala, and it is widely used with horizontal scalability and high throughput.

2. Generate background

Kafka is a messaging system used as the basis for LinkedIn's Activity Stream and operational data processing pipeline (Pipeline). Activity flow data is the most common part of the data that almost all sites use when reporting on their website usage.

Activity data includes information about page views (Page View), content being viewed, and search conditions. The usual way of processing this kind of data is to first write various activities into a certain file in the form of a log, and then perform statistical analysis on these files periodically.

Operational data refers to the performance data of the server (CPU, IO usage rate, request time, service log, etc.). There are many kinds of statistical methods for operating data.

3. Basic structure diagram

4. Explanation of basic concepts

Broker

A Kafka cluster contains one or more servers, which are called brokers. The broker does not maintain the consumption status of data, which improves performance. Direct use of disks for storage, linear read and write, fast speed: avoiding data copying between JVM memory and system memory, reducing performance-consuming object creation and garbage collection.

Producer

Responsible for publishing messages to Kafka broke

Consumer

The message consumer, the client that reads the message from the Kafka broker, and the consumer pulls data from the broker and processes it.

Topic

Each message published to the Kafka cluster has a category, and this category is called Topic. (Physically different Topic messages are stored separately. Logically, although a Topic message is stored on one or more brokers, users only need to specify the Topic of the message to produce or consume data without worrying about where the data is stored.)

Partition

Parition is a physical concept, and each topic contains one or more Partitions.

Consumer Group

Each Consumer belongs to a specific Consumer Group (you can specify a group name for each Consumer, if you do not specify a group name, it belongs to the default group)

Topic & Partition

Topic can be considered as a queue logically, and each consumer must specify its Topic, which can be simply understood as the need to specify which queue to put this message into. In order to increase the throughput rate of Kafka linearly, Topic is physically divided into one or more Partitions, and each Partition physically corresponds to a folder, under which all messages and index files of this Partition are stored.

If you create two topics, topic1 and topic2, with 13 and 19 partitions, respectively, a total of 32 folders will be generated on the entire cluster (the cluster used in this article has a total of 8 nodes, where topic1 and topic2 replication-factor are both As 1).

5. Applicable scenarios

Messaging

For some conventional messaging systems, Kafka is a good choice; partitons/replication and fault tolerance can make Kafka have good scalability and performance advantages. However, so far, we should clearly realize that Kafka does not provide JMS "Transactional", "message transmission guarantee (message confirmation mechanism), "message grouping" and other enterprise-level features; Kafka can only be used as a "regular" messaging system. To a certain extent, it has not yet ensured that the sending and receiving of messages is absolutely reliable ( For example, message retransmission, message transmission loss, etc.)

Website activity tracking

Kafka can be used as the best tool for "website activity tracking"; it can send webpages/user operations and other information to kafka. Real-time monitoring, or offline statistical analysis, etc.

Metrics

Kafka is usually used for actionable monitoring data. This includes aggregate statistics from distributed applications used to produce centralized operational data feeds.

Log Aggregation

The characteristics of kafka determine that it is very suitable as a "log collection center"; applications can send operational logs to the Kafka cluster in "batch" and "asynchronous", instead of storing them locally or in the DB; Kafka can submit messages in batches/compress messages, etc. For the producer side, there is almost no performance expense. At this time, the consumer side can make hadoop and other systematic storage and analysis systems.

Finally, pay attention to the technical road of the official account of migrant workers, and you can get the message queue, middleware-related technical articles, and interview questions that I have compiled, which is very complete.

Link: https://blog.csdn.net/code52/article/details/50475511


民工哥
26.4k 声望56.7k 粉丝

10多年IT职场老司机的经验分享,坚持自学一路从技术小白成长为互联网企业信息技术部门的负责人。2019/2020/2021年度 思否Top Writer