About FAQ Knowledge Base

Usually in the Pulsar communication group, we find that people will encounter similar problems repeatedly in the process of contacting and using Pulsar. In order to solve these "high-frequency questions" more efficiently, and to express our gratitude to friends who raised high-quality questions, we have established a FAQ knowledge base to facilitate the collection and answering of your questions.

We will regularly collect and screen the high-frequency questions raised in the community, and the community experts will screen out the high-quality questions to answer. After integration and optimization, they will be shared with the community members as a priority reference when encountering problems. I hope it can help everyone. Solve problems in using Pulsar.

Here's a look at the questions collected in this issue:

Count the number of messages under a topic

Question 1: use the statistics of all messages (not Entry) under a topic?

Answer: There is no such feature at present. The Backlog of Message Level can be supported according to the current Message Index feature. Relatively speaking, the processing of Exclusive (exclusive) and Failover (disaster recovery) modes will be simpler, and it will be more complicated for subscriptions that support discontinuous receipt of a single message.

Millions of topic configurations

Question 2: How to configure Bookies, Broker and hardware under

Answer: The capacity evaluation needs to combine factors such as the size of the message/the actual batch size that can be received in the business. If a message of 100 Bytes is sent by one client, the client is packaged and sent every 2ms, and two copies of the data are written, then basically two machines are enough, but this is obviously not in line with the actual application scenario. For Pulsar, the capacity evaluation mainly considers several factors:

  1. Message throughput: directly determines the requirements for network cards and disks, and the maximum throughput supported by hardware must be >= expected throughput;
  2. The influence of the actual batch: determines the demand for the CPU, which should be based on the specific test data, do some benchmarks by setting different batch sizes, and then make an evaluation according to the actual business situation;
  3. Number of Topic: Determine the demand for Metadata Service. If there are several thousand/tens of thousands of Topics, you don't need to consider the impact of this aspect too much; It is on the disk, such as Log Data and Data are separated, try to use Nvme SSD and so on.

Related configuration:

  • Broker configuration: The main configuration is network card and CPU. It is recommended to use a 10G network card. Broker will write multiple copies to Bookie in parallel, so the write bandwidth that a single Broker can carry should be divided by the number of copies.
  • Bookie configuration: The main configuration is the network card and disk. It is recommended to use a 10 Gigabit network card. If you use an HDD for the Journal disk, you must turn off the synchronous disk flushing, otherwise the delay of flushing the disk will directly affect the write throughput.

The above are just some of the more general considerations and suggestions, and the capacity evaluation should be based on the actual test results. There are also many simple and easy-to-use tools in the community, such as Pulsar-perf or Openmessaging Benchmark that comes with Pulsar, which can easily perform performance testing on a specified Pulsar cluster.

How to clean up files when the disk is full

Question 3: What files need to be cleaned up after the Pulsar disk is full?

Answer: Cleaning the Bookie cannot be done by cleaning the files directly, which will lead to inconsistencies in the state of the entire Bookie, and may lead to irreversible disasters once the files are deleted.

Processing this file depends on the environment. If it is a test environment or other environment that does not care about data loss, you can directly use Admin to delete all topics in the cluster, and then trigger manual GC curl one by one Bookie node, Bookie can gradually release disk space.

If it is an online environment, it is definitely not possible to delete all topics directly. First, give priority to expansion to ensure that the service will not be interrupted, and then check which topics occupy a large amount of storage space according to the storage size used by the topics. The mechanism of Pulsar is that Subscription messages without Ack will not be deleted. Therefore, if you find that the Topic data cannot be deleted due to some unused Subscriptions, you can check whether the Subscription is reused/can be cleaned up. In short, find those topics that can clean up the data, let the unused data expire or be deleted, and then go to the Bookie node to trigger manual GC to clean up the data.

Consumption Sites and Ack

Question 4: newly connected consumers of 161e7073d82646 Pulsar purchased from the latest one?

Answer: If you haven't subscribed, start consuming from the latest.

Question 5: can specify the consumption site?

Answer: ConsumerBuilder#subscriptionInitialPosition. The default is latest, which can be changed to earlyest. If you have already subscribed, any changes will be the latest. If necessary, use Consumer#seek to change the position.

Question 6: The is Ack, if there is no retention policy, it will be deleted immediately and cannot be backtracked?

Answer: Marking for deletion is not an immediate deletion. Seek backtracking if not actually deleted.

BookKeeper Ledger full

Issue 7: single Ledger directory is full, but other directories are free.

Answer: BookKeeper PR-2121 has resolved the issue by preventing entry log from stopping compression, please make sure to use BookKeeper version 4.10 or above.

Microservice Scenario - Instance Subscription

Question 8: In a microservice scenario, there may be multiple instances of a service. Do these instances all use different subscription names?

Answer: Each subscription can read a complete topic data in Pulsar. According to your usage scenario, if you want each instance to get a complete Topic data, each instance should use a different subscription name; if different instances share a Topic data, they should use the same subscription name. Most scenarios will be the latter.

The above is the summary of the second issue of the community FAQ. I would like to thank the friends who participated in the daily questions and answers of the community. Let's look forward to the next FAQ content!

related suggestion

Follow the public account "Apache Pulsar " to get dry goods and news

Join the Apache Pulsar Chinese exchange group👇🏻


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统