Best Practices | From Producer to Consumer, How to Effectively Monitor Kafka - 阿里巴巴云原生

For operation and maintenance personnel, how to install and maintain a monitoring system, or how to select technical models, has never been the focus of work. How to use tools to monitor the required applications and components, and find and solve problems is the top priority.
As Prometheus gradually becomes an observable standard in the cloud-native era, in order to help more operation and maintenance personnel make good use of Prometheus, the Alibaba Cloud cloud-native team will regularly update the Prometheus best practice series. In the first issue, we explained "Best Practices | How Spring Boot Applications Connect to Prometheus Monitoring". Today, we will bring you the best practices for monitoring the message queue product Kafka.

This article mainly includes three parts: an overview of Kafka, the interpretation of common key indicators, and how to establish a corresponding monitoring system.

What is Kafka

Kafka origin

Kafka is a distributed publish-subscribe messaging system developed by Linkedin and donated to the Apache Software Foundation. The purpose of Kafka is to unify online and offline message processing through the parallel loading mechanism of Hadoop, and to provide real-time real-time information through clusters. information.

title=

Kafka was born to solve Linkedin's data pipeline problem and is used as the basis for LinkedIn's Activity Stream and operational data processing pipeline (Pipeline). At first Linkedin used ActiveMQ for data exchange, but ActiveMQ at that time could not meet Linkedin's requirements for data transmission system, and problems such as message blocking or service failure to access frequently occurred. Linkedin decided to develop its own message queue, and Jay Kreps, then the chief architect of Linkedin, began to form a team to develop a message queue.

Kafka features

Compared with other message queue products, Kafka has the following features:

Persistence: messages are persisted to the local disk, and data backup is supported to prevent data loss;
High throughput: Kafka can process millions of messages per second;
Scalable: Kafka cluster supports hot expansion;
Fault tolerance: allow nodes in the cluster to fail (if the number of replicas is n, allow n-1 nodes to fail);
High Concurrency: Support thousands of clients to read and write at the same time.

At the same time, unlike other message queuing products, Kafka does not use AMQP or any other pre-existing protocol for communication, using a custom binary protocol based on TCP. And has strong ordering semantics and persistence guarantees.

Kafka application scenarios

Based on the above characteristics, Kafka can meet various demand scenarios by processing large amounts of data in real time:

Big data field: such as website behavior analysis, log aggregation, application monitoring, streaming data processing, online and offline data analysis and other fields.
Data integration: import messages into offline data warehouses such as ODPS, OSS, RDS, Hadoop, and HBase.
Stream computing integration: Integrate with stream computing engines such as StreamCompute, E-MapReduce, Spark, and Storm.

Kafka technical architecture

A message queue Kafka version cluster includes Producer, Kafka Broker, Consumer Group, and Zookeeper.

title=

Producer: The message publisher, also known as the message producer, sends messages to the Broker in the Push mode. The messages sent can be website page visits, server logs, or system resource information related to CPU and memory.
Broker: A server used to store messages. Broker supports horizontal scaling. The higher the number of broker nodes, the higher the cluster throughput.
Consumer Group: Consumers are called message subscribers or message consumers, and are responsible for reading messages from the server and consuming them. Consumer Group refers to a type of Consumer. This type of Consumer usually receives and consumes the same type of messages, and the message consumption logic is consistent. Subscribe and consume messages from the Broker via Pull mode.
Zookeeper: Manages cluster configuration, elects leader partitions, and performs load balancing when Consumer Group changes. It is worth mentioning that a Kafka deployment cannot be done without ZooKeeper. ZooKeeper is the glue that holds everything together
Publish/subscribe model: Kafka adopts a publish/subscribe model. The corresponding relationship between Consumer Group and Topic is N: N, that is, a Consumer Group can subscribe to multiple Topic at the same time, and a Topic can also be subscribed to multiple Consumer Groups at the same time. Although a Topic can be subscribed by multiple Consumer Groups at the same time, the Topic can only be consumed by any Consumer in the same Consumer Group.

Monitoring key metrics for Kafka

Here we explain based on Kafka cloud service and self-built Kafka two different products.

If the Kafka used is a managed service provided by a cloud vendor, the indicators exposed to the outside world are relatively limited, and Zookeeper-related indicators can be ignored. Taking Alibaba Cloud Kafka as an example, it mainly monitors each resource type:

1. Instance monitoring items

Instance message production traffic (bytes/s)
Instance message consumption traffic (bytes/s)
Instance disk usage (%) - the maximum value of disk usage in each node of the instance

2. Topic monitoring items

Topic message production traffic (bytes/s)
Topic message consumption traffic (bytes/s)

3. Group monitoring items

Group Total number of unconsumed messages (pieces)

If you use self-built Kafka, you need to pay attention to a lot of indicators, mainly including the following four directions: Broker, Producer, Consumer, and Zookeeper.

title=

Broker Indicator

Since all messages must pass through the broker to be consumed, it is very important to monitor and alert the broker. Broker metrics focus on: Kafka-emitted metrics, Host-level metrics, and JVM garbage collection metrics.

Broker - Kafka-emitted metrics

Number of unreplicated partitions: UnderReplicatedPartitions (availability) kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

In a healthy cluster, the number of in-sync replicas (ISRs) should equal the total number of replicas. If the partition replica is far behind the leader, the follower is removed from the ISR pool. If the proxy is unavailable, the UnderReplicatedPartitions metric increases dramatically. Tips: UnderReplicatedPartitions is greater than zero for a long time and needs to be checked.

The rate at which the synchronous replica (ISR) pool shrinks/expands: IsrShrinksPerSec / IsrExpandsPerSec (availability) kafka.server:type=ReplicaManager,name=IsrShrinksPerSec

Tips: If a replica has not contacted the leader for a period of time or the offset of the follower is far behind the leader, it will be deleted from the ISR pool. Therefore, the relative fluctuation of IsrShrinksPerSec / IsrExpandsPerSec needs to be concerned. An increase in IsrShrinksPerSec should not cause an increase in IsrExpandsPerSec. The number of in-sync replicas (ISRs) for a particular partition should remain relatively stable, except in special cases such as scaling a Brokers cluster or deleting a partition.

Number of offline partitions (controller only): OfflinePartitionsCount (availability) kafka.controller:type=KafkaController,name=OfflinePartitionsCount

As the name implies, it mainly counts the number of partitions without active leaders. Tips: Since all read and write operations are only performed on the partition bootloader, if this indicator has a non-zero value, you need to pay attention to prevent service interruption.

Number of active controllers in the cluster: ActiveControllerCount (availability) kafka.server:type=ReplicaManager,name=IsrShrinksPerSec

Tips: The sum of ActiveControllerCount in all brokers is always equal to 1. If there is a fluctuation, you should alarm in time. The first node started in the Kafka cluster will automatically become the Controller and there is only one. The Controller in the Kafka cluster is responsible for maintaining the partition leader list and coordinating leader changes (for example, a partition leader is unavailable).

Number of UncleanLeader elections per second: UncleanLeaderElectionsPerSec (availability) kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec

Between availability and consistency, Kafka defaults to availability. Unclean leader election occurs when the partition leader of Kafka Brokers is unavailable. When the agent acting as the leader of a partition goes offline, a new leader is elected from the set of ISRs for that partition. Tips: UncleanLeaderElectionsPerSec represents data loss, so an alert is required.

Specific request (production/fetch) time: TotalTimeMs (performance) kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}

TotalTimeMs is used as an indicator family to measure the time spent on service requests (including production requests, getting consumer requests or getting follower requests), including the time spent waiting in the request queue Queue, the time spent processing Local, waiting The time it takes for the consumer to respond The time Remote (only when requests.required.acks=-1) takes to send the reply Response.

Tips: Under normal circumstances, TotalTimeMs should be approximately static and only have very small fluctuations. If an anomaly is found, the individual queue, local, remote, and response values need to be checked to locate the exact request segment that is causing the slowdown.

Incoming/Outgoing Byte Rate: BytesInPerSec / BytesOutPerSec (performance) kafka.server:type=ReplicaManager,name=IsrShrinksPerSec

Tips: We can consider whether to enable optimization measures such as end-to-end compression of messages. Disk throughput and network throughput may become the performance bottleneck of Kafka. For example, messages are sent across data centers and there are a large number of topics, or the replica happens to be the leader.

Requests per second: RequestsPerSec (performance) kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower},version=([0-9]+)

Through RequestsPerSec, understand the request rate of Producer, Consumer, and Followers to ensure efficient communication of Kafka.

Tips: The request rate will increase as the Producer sends more traffic or the cluster expands, thereby increasing the number of Consumers or Followers that need to fetch messages. If RequestsPerSec continues to be high, you need to consider adding Producer, Consumer, and Followers. Increase throughput by reducing the number of requests and reduce unnecessary overhead.

Broker - Host Basic Metrics & JVM Garbage Collection Metrics

In addition to the relevant indicators at the host level, since Kafka is written in Scala and runs on the JVM, it needs to rely on Java's garbage collection mechanism to release memory, and as the cluster activity increases, the frequency of garbage collection continues to increase.

Consumption Disk Space Consumption vs Free Disk Space: Disk usage (availability) Since Kafka persists all data to disk, it is necessary to monitor the amount of disk space available to Kafka.
Page cache read to disk read ratio: Page cache reads ratio (performance) is similar to database cache-hit ratio cache hit ratio, the higher the index, the faster the read speed and the better the performance. If the replica catches up with the leader (such as spawning a new agent), the metric drops briefly.
CPU usage: CPU usage (Performance) CPU is rarely the root cause of performance problems. But if CPU usage spikes, it's a good idea to check.
Network bytes send/receive (performance) in the case of a proxy hosting other network services. High network usage can be a precursor to performance degradation.
The total number of garbage collection processes performed by the JVM: CollectionCount (performance) java.lang:type=GarbageCollector,name=G1 (Young|Old) Generation

YoungGarbageCollector happens relatively often. All application threads are paused during execution, so fluctuations in this metric can cause fluctuations in Kafka performance.

JVM execution time of garbage collection process: CollectionTime (performance) java.lang:type=GarbageCollector,name=G1 (Young|Old) Generation

OldGarbageCollector releases unused memory in the old stack, and although it also suspends application threads, it only runs intermittently. If the action is time-consuming or occurs frequently, you need to consider whether there is corresponding memory support.

Producer metrics

Producer pushes messages to Broker for consumption. If the Producer fails, the Consumer will have no new messages. Therefore, we need to monitor the following metrics to ensure a stable incoming data flow.

Average number of responses received per second: Response rate (performance) kafka.[producer|consumer|connect]:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+ ),node-id=([0-9]+)

For Producers, the response rate represents the response rate received from Brokers. After receiving the data, the Brokers respond to the Producer. Combined with the actual configuration of request.required.acks, "received" has different meanings, for example: the leader has written the message to the disk, and the leader has received confirmation from all replicas that the data has been written to the disk. Producer data is not available for consumption until an acknowledgment is received.

Average number of requests sent per second: Request rate (performance) kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower},version=([0-9]+)

The request rate refers to the rate at which the Producer sends data to the Brokers. The rate trend is an important indicator to ensure service availability.

Average request waiting time: Request latency average (performance) kafka.[producer|consumer|connect]:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node -id=([0-9]+)

The amount of time from when KafkaProducer.send() is called to when the Producer receives the response from the Broker. The Producer's linger.ms value determines the maximum time it will wait before sending a batch of messages, which allows it to accumulate a large number of messages before sending them in a single request. If increasing linger.ms improves Kafka throughput, you should focus on request latency to ensure that the limit is not exceeded.

Average outgoing/incoming bytes per second: Outgoing byte rate (performance) kafka.producer:type=producer-metrics,client-id=([-.w]+)

Understand Producer efficiency and locate possible causes of transport delays.

Average I/O thread waiting time: I/O wait time (performance) kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w] +)
Average bytes sent per request per partition: Batch size (performance)

kafka.producer:type=producer-metrics,client-id=([-.w]+)

To improve network resource usage, the Producer tries to group messages before sending them. The Producer will wait to accumulate the amount of data defined by batch.size, and the waiting time is bounded by linger.ms.

Consumer metrics

The number of messages that the Consumer lags the Producer on this partition: Records lag/Records lag max (performance) kafka.consumer:type=consumer-fetch-manager-metrics,partition="{partition}",topic="{topic}" ,client-id="{client-id}"

This metric is used to record the calculated difference between the current log offset of the Consumer and the current log offset of the Producer. If the Consumer is processing real-time data, a consistently high lag value may indicate an overloaded consumer, in which case provisioning more consumers and dividing topics into more partitions increases throughput and reduces lag.

Average number of bytes consumed per second for a specific topic: bytes consumed rate (performance) kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}",topic="{topic}"
Average number of records consumed per second for a specific topic: records consumed rate (performance) kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}",topic="{topic}"
The number of requests fetched by the consumer per second: fetch rate (performance) kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}",topic="{topic}"

This indicator can directly reflect the overall status of the Consumer. A fetch rate close to zero indicates a problem with the Consumer. If the indicator drops, it may be that the Consumer failed to consume the message.

For related indicators, please refer to the official Kafka document. The indicator name, indicator definition, and Mean name are subject to the latest version in the document during the actual operation.

Build a monitoring system

Monitoring through self-built Prometheus

The open source Prometheus construction process is not explained here (although it is relatively complicated, but the technical community has nanny-level tutorials, you can Baidu by yourself). Only the related Kafka Exporter is briefly introduced here. The latest version is v1.4.2, which was released on 2021.09.16. The last update was 3 months ago on kafka_exporter.go.

But if you have encountered one or more of the following scenarios like me:

At the primary level, you can't handle open source Prometheus deployment by yourself;
Lazy, and do not want to maintain the Prometheus system on a daily basis, including related component updates and overall system expansion;
The business launch is very urgent, and a corresponding monitoring system needs to be in place immediately;
Enterprise users want low-cost Prometheus services, unlimited database size, high performance and high availability

Then, the Alibaba Cloud Prometheus monitoring service is the best choice. You don't need to consider the above problems anymore, and you can really use it out of the box with one-click integration.

Monitoring through Alibaba Cloud Prometheus Monitoring

Log in to the Prometheus console. Select the target region in the upper left corner of the page, and then click the name of the Prometheus instance of Container Service, Kubernetes or ECS type as required. Click Component Monitoring in the left navigation bar.

Add a Kafka type component

On the Component Monitoring page, click Add Component Monitoring in the upper right corner. Click the Kafka component icon in the Access Center panel. Enter various parameters on the Configuration tab in the STEP2 area of the Access Kafka panel, and click OK. You can view monitoring indicators on the Indicators tab in the STEP2 area of the Access Kafka panel.

title=

Collect relevant metrics by default

View related metrics

On the component monitoring page, connected component instances are displayed. Click the large disk in the large disk column of the component instance to view the monitoring indicator data of the component. More comprehensive data presentation through Grafana.

6（1）.png

title=

If you purchase Kafka cloud products, you can use "Prometheus for cloud service" to monitor

Log in to the Prometheus console. In the upper left corner of the page, select the target region, and then select New Prometheus Instance. On the pop-up page, click Prometheus Instance for Cloud Service.

title=

Add Alibaba Cloud Kafka monitoring

On the pop-up page, select Add Alibaba Cloud Kafka, and then click the OK button to start Kafka cloud product monitoring.

title=

Collect relevant metrics by default

View related metrics

On the Prometheus cloud monitoring details panel list page, the connected Kafka will be displayed. Click the CMS-KAFKA panel of the component instance panel column to view the monitoring indicator data of this component. More comprehensive data presentation through Grafana.

title=