Original author Jack Vanlightly, translated Sijia@StreamNative, if you need to reprint, please leave a message in the background.
English link: https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-apache-pulsar-cluster
Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate computing and storage architecture design and supports multiple Tenants, persistent storage, and cross-regional data replication in multiple computer rooms have streaming data storage features such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/

It takes about 15 minutes to read this article.

Message system

working principle of the Apache Pulsar describes the protocol for storage and Pulsar model, the main emphasis of (. 1) separated Pulsar computing and storage; (2) Pulsar ledger and the topic into fragments, without re-balance, to automatically write bookie Features of importing new data. This article will use the Blockade tool to shut down the node, reduce the network speed, and lose network data packets for testing. In how the message is lost in RabbitMQ cluster and how the message is lost in Kafka cluster article, I used the same automated testing tools, test herein includes not only data loss, and also includes a message ordering duplicate messages.

I have uploaded the code used for the test to GitHub, and you can view it in the Pulsar folder ChaosTesingCode

In each test scenario, we created a new blockade cluster and performed the same configuration:

  • Number of Apache Pulsar brokers
  • Number of Apache BookKeeper nodes (Bookie)
  • Ensemble size (E)
  • Write quorum size (Qw)
  • Ack quorum size (Qa)

The client communicates with the broker through the proxy.

图片

In each test, we send messages and add interference operations, such as closing the Pulsar broker to which the topic belongs, closing a bookie in the current ledger ensemble, and isolating nodes from ZooKeeper.

After the message is sent, start the reader to read and count the number of messages, to ensure that all ack messages are received, and to ensure that the messages are read in the correct order.

If you want to see the test method, please read the full text; if you only want to see the test results, it is recommended to read the conclusion directly.

Test anatomy

In the test, the producer sends messages at the fastest speed. I set the sending message rate to 20,000-30,000 messages per second. Some tests take a long time, so I increased the total number of sent messages without reducing the producer's sending rate.

Disable message deduplication.

Test by calling the pulsar-test.py python script. The test requires the following parameters:

  • Test type (no-fail, kill-broker, kill-bookie, kill-bookies[n], isolate-broker-from-zk, isolate-bookie-from-zk, custom-isolation[partition|partition]). The test type will be explained in the test section.
  • Test name
  • Number of test runs
  • Number of messages sent
  • The location of the message when the interfering operation started. For example, 50000 means that the interference operation starts when the 50,000th ack is received. It usually takes a period of time to enable interference operations, and usually starts at the hundreds of thousands of messages after activation.
  • Ledger configuration (E, Qw, Qa), the format is E-Qw-Qa, such as 2-2-1
  • Node count, the format is brokers-bookies, such as 3-3
  • Whether to enable data deduplication (yes/no)

blockade.yml files are stored in the cluster/blockade-files directory. The number of brokers and bookies in each file is different. Before calling blockade, you need to copy the corresponding files to the cluster directory.

Call pulsar-test.py perform the following operations:

  1. If the blockade cluster is running, use the blockade `destroy` command to terminate the cluster.
  2. Copy the corresponding blockade. yml
  3. blockade up
  4. make an announcement. The message content is an increasing integer, which is used to check the message sequence.
  5. Start interference operation
  6. As soon as the message is released, the data will be read immediately
  7. Analyze and record the results
  8. Repeat "Number of Test Runs"

All output can be viewed in the terminal. You can also view the output in the [test-name]-output.txt automated/test-output You can view duplicate messages and out-of-sequence messages in files [test-name]-duplicates.txt and [test-name]-out-of-order.txt

The format of the running result is as follows:


Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

Tracking message:

  • After the message is sent, the integer corresponding to the message is added as a key to the messages_sent dictionary, and an empty list is added as the value.
  • After the Ack message is successfully received, add the integer value corresponding to the message to the messages_pos_acked set.
  • After the Ack message fails to be received, add the integer value corresponding to the message to the messages_neg_acked set.

When reading a message, messages_sent dictionary and add the message ID to the corresponding list. In this way, lost and duplicate messages can be detected.

  • Loss of acked messages = all the keys in the messages_sent dictionary and the messages_pos_acked set that overlap with the empty list
  • Received unacked messages = all the values in the messages_sent dictionary are empty lists, but not in the messages_pos_acked collection
  • Duplicate message = messages_sent dictionary, the value list contains keys for multiple message IDs
  • Out of order message = The integer value is less than the previous message, and the value in the dictionary is a list of message IDs that are not empty (that is, they have not been consumed)

Since the producer is only allowed to process 10,000 messages at a time, when the received message lags behind the sent message by 10,000, the producer will pause and wait for ack to catch up. In the test, we did see 10,000.

The Pulsar client will automatically retry when the message fails to be sent until the sending timeout. Therefore, when the message is successfully written to BookKeeper, but the Pulsar broker fails before sending the ack to the client, there will be message duplication. Message deduplication is turned off by default in the test.

Default E, Qw, Qa

A very strange phenomenon appeared in the test. After closing the bookie, the reader and consumer could not be used for consumption. Later we discovered that this is related to the default values of E, Qw, and Qa. The cursor is stored in the ledger created by the above default values. If the number of bookies is not enough, the subscription cannot be created.

Set the environment for all bookie in the blockade.yml

environment: {
   "clusterName": "cluster-1",
   "zookeeperServers": "zk1:2181",
   "managedLedgerDefaultEnsembleSize": 4,
   "managedLedgerDefaultWriteQuorum": 3,
   "managedLedgerDefaultAckQuorum": 2
}

To ensure the normal operation of the cluster, the above configuration requires at least 4 bookies. Therefore, when using a small number of bookies to test, it is necessary to ensure that the number of bookies is the same as the minimum number of bookies required in the usage scenario (also consider that the bookie may fail).

In all test scenarios, the default values of E, Qw, and Qa are 2, 2, 2, and the number of bookies is not less than 3. This setting ensures that when the bookie is closed, the cursor data will not be lost, and new readers and consumers can be created.

Test scenario 0-contrast experimental group

In the first test scenario, 2 million messages are sent without interference. Check whether all the messages are read, and the reading order is correct, and the messages are not repeated. For this scenario, we tested it 5 times.

The terminal output of the last test is as follows.

$ python pulsar-test.py no-fail no-fail-test1 5 2000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #2 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #3 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #4 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #5 on topic no-fail-test1_5  ------------
2018-10-19 16:30:44.795 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.208.9:6650
2018-10-19 16:30:44.799 INFO  ClientConnection:285 | [192.168.208.1:34476 -> 192.168.208.9:6650] Connected to broker
2018-10-19 16:30:45.965 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/no-fail-test1_5] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-19 16:30:45.965 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, ] Getting connection from pool
2018-10-19 16:30:47.313 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 16:30:47.313 INFO  ClientConnection:287 | [192.168.208.1:34482 -> 192.168.208.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 16:30:47.894 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, ] Created producer on broker [192.168.208.1:34482 -> 192.168.208.9:6650] 
Send count: 56645 Ack count: 50000 Pos: 50000 Neg: 0
No chaos action to perform
Send count: 102275 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 151578 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 207610 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 255241 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 309300 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 354595 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 407730 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 459060 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 503590 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 557272 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601064 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 653045 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 701051 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 752087 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 800455 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 851194 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 904089 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 958704 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1007863 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1053645 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1102725 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1150921 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1201317 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1254143 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1303459 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1358449 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1408474 Ack count: 1400000 Pos: 1400000 Neg: 0
Send count: 1456386 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1505650 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1550343 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1601060 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1652008 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1701559 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751662 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1801550 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1852799 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1901281 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951079 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-19 16:31:51.177 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, cluster-1-2-0] Closed producer
2018-10-19 16:31:51.346 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.208.9:6650
2018-10-19 16:31:51.348 INFO  ClientConnection:285 | [192.168.208.1:34574 -> 192.168.208.9:6650] Connected to broker
2018-10-19 16:31:51.357 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Getting connection from pool
2018-10-19 16:31:51.645 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 16:31:51.645 INFO  ClientConnection:287 | [192.168.208.1:34578 -> 192.168.208.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 16:31:51.716 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Created consumer on broker [192.168.208.1:34578 -> 192.168.208.9:6650] 
Last confirmed entry: [0, 2557]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 58]
Received: 100000 Curr Entry: [0, 115]
Received: 150000 Curr Entry: [0, 177]
Received: 200000 Curr Entry: [0, 236]
Received: 250000 Curr Entry: [0, 300]
Received: 300000 Curr Entry: [0, 386]
Received: 350000 Curr Entry: [0, 446]
Received: 400000 Curr Entry: [0, 506]
Received: 450000 Curr Entry: [0, 569]
Received: 500000 Curr Entry: [0, 629]
Received: 550000 Curr Entry: [0, 695]
Received: 600000 Curr Entry: [0, 754]
Received: 650000 Curr Entry: [0, 816]
Received: 700000 Curr Entry: [0, 892]
Received: 750000 Curr Entry: [0, 955]
Received: 800000 Curr Entry: [0, 1021]
Received: 850000 Curr Entry: [0, 1086]
Received: 900000 Curr Entry: [0, 1152]
Received: 950000 Curr Entry: [0, 1211]
Received: 1000000 Curr Entry: [0, 1268]
Received: 1050000 Curr Entry: [0, 1339]
Received: 1100000 Curr Entry: [0, 1410]
Received: 1150000 Curr Entry: [0, 1472]
Received: 1200000 Curr Entry: [0, 1534]
Received: 1250000 Curr Entry: [0, 1595]
Received: 1300000 Curr Entry: [0, 1654]
Received: 1350000 Curr Entry: [0, 1713]
Received: 1400000 Curr Entry: [0, 1779]
Received: 1450000 Curr Entry: [0, 1836]
Received: 1500000 Curr Entry: [0, 1896]
Received: 1550000 Curr Entry: [0, 1971]
Received: 1600000 Curr Entry: [0, 2043]
Received: 1650000 Curr Entry: [0, 2101]
Received: 1700000 Curr Entry: [0, 2164]
Received: 1750000 Curr Entry: [0, 2226]
Received: 1800000 Curr Entry: [0, 2297]
Received: 1850000 Curr Entry: [0, 2367]
Received: 1900000 Curr Entry: [0, 2433]
Received: 1950000 Curr Entry: [0, 2496]
Received: 2000000 Curr Entry: [0, 2557]
Read phase complete with message (0,2557,378,-1)
2018-10-19 16:32:35.080 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

You can view the summary results in the test-output/no-fail-test1-output.txt file.

16:11:45 Start test
16:13:54: 
16:13:54: Test Run #1 on topic no-fail-test1_1  ------------
16:15:41: Results --------------------------------------------
16:15:41: Final send count: 2000000
16:15:41: Final ack count: 2000000
16:15:41: Final positive ack count: 2000000
16:15:41: Final negative ack count: 0
16:15:41: Messages received: 2000000
16:15:41: Acked messages missing: 0
16:15:41: Non-acked messages received: 0
16:15:41: Out-of-order: 0
16:15:41: Duplicates: 0
16:15:41: ----------------------------------------------------
16:17:57: 
16:17:57: Test Run #2 on topic no-fail-test1_2  ------------
16:19:44: Results --------------------------------------------
16:19:44: Final send count: 2000000
16:19:44: Final ack count: 2000000
16:19:44: Final positive ack count: 2000000
16:19:44: Final negative ack count: 0
16:19:44: Messages received: 2000000
16:19:44: Acked messages missing: 0
16:19:44: Non-acked messages received: 0
16:19:44: Out-of-order: 0
16:19:44: Duplicates: 0
16:19:44: ----------------------------------------------------
16:22:01: 
16:22:01: Test Run #3 on topic no-fail-test1_3  ------------
16:23:52: Results --------------------------------------------
16:23:52: Final send count: 2000000
16:23:52: Final ack count: 2000000
16:23:52: Final positive ack count: 2000000
16:23:52: Final negative ack count: 0
16:23:52: Messages received: 2000000
16:23:52: Acked messages missing: 0
16:23:52: Non-acked messages received: 0
16:23:52: Out-of-order: 0
16:23:52: Duplicates: 0
16:23:52: ----------------------------------------------------
16:26:20: 
16:26:20: Test Run #4 on topic no-fail-test1_4  ------------
16:28:18: Results --------------------------------------------
16:28:18: Final send count: 2000000
16:28:18: Final ack count: 2000000
16:28:18: Final positive ack count: 2000000
16:28:18: Final negative ack count: 0
16:28:18: Messages received: 2000000
16:28:18: Acked messages missing: 0
16:28:18: Non-acked messages received: 0
16:28:18: Out-of-order: 0
16:28:18: Duplicates: 0
16:28:18: ----------------------------------------------------
16:30:43: 
16:30:43: Test Run #5 on topic no-fail-test1_5  ------------
16:32:36: Results --------------------------------------------
16:32:36: Final send count: 2000000
16:32:36: Final ack count: 2000000
16:32:36: Final positive ack count: 2000000
16:32:36: Final negative ack count: 0
16:32:36: Messages received: 2000000
16:32:36: Acked messages missing: 0
16:32:36: Non-acked messages received: 0
16:32:36: Out-of-order: 0
16:32:36: Duplicates: 0
16:32:36: ----------------------------------------------------

It can be seen from the terminal output that the speed is not very fast. The five tests took approximately 20 minutes. The main problem is that it is slow to create a blockade cluster. Using RabbitMQ and Kafka, the running speed is very fast; but the Pulsar cluster takes up a lot of CPU and starts slowly. The reason may be related to the fact that we are running 8 nodes (1 ZK, 1 Proxy, 3 Broker, 3 Bookie), and the number of nodes reaches 11 at most.

There were no problems with missing, repeated, or out-of-order messages in the test.

Test scenario 1-close topic broker owner

Parameter configuration 2-2-1 3-3.

In the test, pulsar-test.py uses a shell script to detect the broker owner and close it during message transmission.

The parameter configuration is E=2, Qw=2, Qa=1, 3 brokers, 3 bookie, this configuration has the least redundancy. There are two copies of each message, but after the broker receives an ack from any bookie, it will send the ack to the client.

Closing the broker during message transmission will not cause data loss. Only after Qa bookie persists the message to disk, the message is acked, and the ack is sent by the broker. If the broker is closed, the worst case is that the reader reads messages that have not been acked. Bookie may have persisted the message, but the broker failed before sending the ack to the client. After the new broker is enabled, these messages will be detected during recovery, and a copy of the messages will be generated before closing the ledger to make these messages available to the reader.

The following is the complete output of a test. In each test, 1 million messages are sent, and the owner is closed when the 50,000th message is sent. We conducted this test 5 times.

$ python pulsar-test.py kill-broker kill-broker-test1 5 1000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-broker-test1_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-19 20:31:48.389 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 20:31:48.389 INFO  ClientConnection:287 | [192.168.96.1:52202 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
Send count: 51706 Ack count: 50000 Pos: 50000 Neg: 0
Send count: 101199 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 155831 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 208015 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 250923 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 303241 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 350970 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 400041 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 450612 Ack count: 450000 Pos: 450000 Neg: 0
-------------------------------------------------
pulsar2 is the topic owner, killing pulsar2!!!!!!
-------------------------------------------------
Send count: 501553 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551276 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 602319 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 651554 Ack count: 650000 Pos: 650000 Neg: 0
2018-10-19 20:32:07.710 INFO  ClientConnection:1237 | [192.168.96.1:52202 -> 192.168.96.9:6650] Connection closed
2018-10-19 20:32:07.710 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Schedule reconnection in 0.1 s
2018-10-19 20:32:07.710 INFO  ClientConnection:195 | [192.168.96.1:52202 -> 192.168.96.9:6650] Destroyed connection
2018-10-19 20:32:07.810 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:07.815 ERROR ClientConnection:792 | [192.168.96.1:52196 -> 192.168.96.9:6650] Failed lookup req_id: 3 error: 6
2018-10-19 20:32:07.815 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Schedule reconnection in 0.192 s
2018-10-19 20:32:08.009 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:08.500 INFO  ConnectionPool:53 | Deleting stale connection from pool for pulsar://pulsar2:6650 use_count: -1 @ 0
2018-10-19 20:32:08.500 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 20:32:08.501 INFO  ClientConnection:287 | [192.168.96.1:52276 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
-------------------------------------------------
pulsar2 KILLED!
-------------------------------------------------
Send count: 702977 Ack count: 700000 Pos: 692977 Neg: 7023
2018-10-19 20:32:38.508 INFO  ClientConnection:1237 | [192.168.96.1:52276 -> 192.168.96.9:6650] Connection closed
2018-10-19 20:32:38.509 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Schedule reconnection in 0.1 s
2018-10-19 20:32:38.509 INFO  ClientConnection:195 | [192.168.96.1:52276 -> 192.168.96.9:6650] Destroyed connection
2018-10-19 20:32:38.610 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:38.647 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-19 20:32:38.647 INFO  ClientConnection:287 | [192.168.96.1:52306 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-19 20:32:39.592 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Created producer on broker [192.168.96.1:52306 -> 192.168.96.9:6650] 
Send count: 750954 Ack count: 750000 Pos: 740000 Neg: 10000
Send count: 805568 Ack count: 800000 Pos: 790000 Neg: 10000
Send count: 853879 Ack count: 850000 Pos: 840000 Neg: 10000
Send count: 904020 Ack count: 900000 Pos: 890000 Neg: 10000
Send count: 953552 Ack count: 950000 Pos: 940000 Neg: 10000
Send count: 1000000 Ack count: 1000000 Pos: 990000 Neg: 10000
2018-10-19 20:32:47.320 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Closed producer
2018-10-19 20:32:47.466 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.96.9:6650
2018-10-19 20:32:47.468 INFO  ClientConnection:285 | [192.168.96.1:52406 -> 192.168.96.9:6650] Connected to broker
2018-10-19 20:32:47.477 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Getting connection from pool
2018-10-19 20:32:47.496 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-19 20:32:47.496 INFO  ClientConnection:287 | [192.168.96.1:52410 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-19 20:32:47.562 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Created consumer on broker [192.168.96.1:52410 -> 192.168.96.9:6650]
-------------------------------------------------
READ PHASE
-------------------------------------------------
Last confirmed entry: [1, 387]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 72]
Received: 100000 Curr Entry: [0, 148]
Received: 150000 Curr Entry: [0, 207]
Received: 200000 Curr Entry: [0, 272]
Received: 250000 Curr Entry: [0, 335]
Received: 300000 Curr Entry: [0, 400]
Received: 350000 Curr Entry: [0, 466]
Received: 400000 Curr Entry: [0, 534]
Received: 450000 Curr Entry: [0, 611]
Received: 500000 Curr Entry: [0, 688]
Received: 550000 Curr Entry: [0, 747]
Received: 600000 Curr Entry: [0, 815]
Received: 650000 Curr Entry: [0, 913]
Received: 700000 Curr Entry: [1, 7]
Received: 750000 Curr Entry: [1, 83]
Received: 800000 Curr Entry: [1, 157]
Received: 850000 Curr Entry: [1, 227]
Received: 900000 Curr Entry: [1, 287]
Received: 950000 Curr Entry: [1, 343]
Read phase complete with message (1,387,469,-1)
2018-10-19 20:33:11.943 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 1000000
Final ack count: 1000000
Final positive ack count: 990000
Final negative ack count: 10000
Messages received: 990000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

It can be seen that in the message transmission process, after identifying pulsar2 as the owner, it will be closed soon. The client has a connection failure problem and reconnects to the new owner pulsar1 through the proxy. After successful reconnection, continue to send messages.

Due to the limitation of the producer, usually only 10,000 messages are transmitted, and the reception of all these 10,000 messages fails (timeout). In this way, there are 990,000 successfully received messages, no duplicate messages, and all are read in an orderly manner.

We conducted a total of 5 tests, and the test results are as follows:

19:59:08: 
19:59:08: Test Run #1 on topic kill-broker-test_1  ------------
20:00:41: Results --------------------------------------------
20:00:41: Final send count: 1000000
20:00:41: Final ack count: 1000000
20:00:41: Final positive ack count: 990000
20:00:41: Final negative ack count: 10000
20:00:41: Messages received: 990000
20:00:41: Acked messages missing: 0
20:00:41: Non-acked messages received: 0
20:00:41: Out-of-order: 0
20:00:41: Duplicates: 0
20:00:41: ----------------------------------------------------
20:02:54: 
20:02:54: Test Run #2 on topic kill-broker-test_2  ------------
20:03:54: Results --------------------------------------------
20:03:54: Final send count: 1000000
20:03:54: Final ack count: 1000000
20:03:54: Final positive ack count: 1000000
20:03:54: Final negative ack count: 0
20:03:54: Messages received: 1000000
20:03:54: Acked messages missing: 0
20:03:54: Non-acked messages received: 0
20:03:54: Out-of-order: 0
20:03:54: Duplicates: 0
20:03:54: ----------------------------------------------------
20:06:06: 
20:06:06: Test Run #3 on topic kill-broker-test_3  ------------
20:07:31: Results --------------------------------------------
20:07:31: Final send count: 1000000
20:07:31: Final ack count: 1000000
20:07:31: Final positive ack count: 990000
20:07:31: Final negative ack count: 10000
20:07:31: Messages received: 991065
20:07:31: Acked messages missing: 0
20:07:31: Non-acked messages received: 1065
20:07:31: Out-of-order: 0
20:07:31: Duplicates: 0
20:07:31: ----------------------------------------------------
20:09:42: 
20:09:42: Test Run #4 on topic kill-broker-test_4  ------------
20:11:18: Results --------------------------------------------
20:11:18: Final send count: 1000000
20:11:18: Final ack count: 1000000
20:11:18: Final positive ack count: 990000
20:11:18: Final negative ack count: 10000
20:11:18: Messages received: 990000
20:11:18: Acked messages missing: 0
20:11:18: Non-acked messages received: 0
20:11:18: Out-of-order: 0
20:11:18: Duplicates: 0
20:11:18: ----------------------------------------------------
20:13:36: 
20:13:36: Test Run #5 on topic kill-broker-test_5  ------------
20:15:09: Results --------------------------------------------
20:15:09: Final send count: 1000000
20:15:09: Final ack count: 1000000
20:15:09: Final positive ack count: 990000
20:15:09: Final negative ack count: 10000
20:15:09: Messages received: 990000
20:15:09: Acked messages missing: 0
20:15:09: Non-acked messages received: 0
20:15:09: Out-of-order: 0
20:15:09: Duplicates: 0
20:15:09: ----------------------------------------------------

It can be seen that no messages are lost. In the third run, 1065 unacknowledged messages were read. If retrying to send a message causes a message to be repeated, this situation is normal. This problem can be solved by enabling deduplication, and we will test it in a later scenario.

As expected, a broker failure will not result in message loss.

Test scenario 2-Close Bookie in the current Ledger Ensemble

Parameter configuration 2-2-1 3-3.

We use the least redundant configuration to test the flexibility of Pulsar, E=2, Qw=2, Qa=1, 3 brokers, and 3 bookies. During the sending of 1 million messages, identify a bookie in the current ledger ensemble and close it. Then, check whether the sequence of the 1 million messages read by the reader is correct.

Here I made a special assumption, and regard the current ledger as the first ledger. After conducting numerous tests, I found that at least the first 1 million messages would fall into the first ledger's entry. Therefore, I decided to focus only on the ensemble in the first ledger, whose path in ZooKeeper is known.

If Qw = 2, only close one bookie, there should be no data loss. If the bookie of the ack message is closed, there is still an entry copy before AutoRecovery copies the data, and the case of data loss is that both copies are lost at the same time.

The complete output of a single run is as follows.

$ python pulsar-test.py kill-bookie kill-bookie-test 5 1000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-bookie-test_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-19 23:46:10.939 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.224.9:6650
2018-10-19 23:46:12.877 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 23:46:12.877 INFO  ClientConnection:287 | [192.168.224.1:50588 -> 192.168.224.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 23:46:13.397 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, ] Created producer on broker [192.168.224.1:50588 -> 192.168.224.9:6650] 
Send count: 52575 Ack count: 50000 Pos: 50000 Neg: 0
Send count: 100560 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 156455 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 203545 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 255199 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 305732 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 357709 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 407634 Ack count: 400000 Pos: 400000 Neg: 0
-------------------------------------------------
bookie1 is in the current ledger ensemble, killing bookie1!!!!!!
-------------------------------------------------
Send count: 451809 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 501234 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551348 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601729 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 650566 Ack count: 650000 Pos: 650000 Neg: 0
-------------------------------------------------
bookie1 KILLED!
-------------------------------------------------
Send count: 701961 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 751294 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 802939 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 850390 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 903739 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 953286 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1000000 Ack count: 1000000 Pos: 1000000 Neg: 0
2018-10-19 23:46:47.281 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, cluster-1-2-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-19 23:46:57.499 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.224.9:6650
2018-10-19 23:46:57.500 INFO  ClientConnection:285 | [192.168.224.1:50756 -> 192.168.224.9:6650] Connected to broker
2018-10-19 23:46:57.508 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Getting connection from pool
2018-10-19 23:46:57.566 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 23:46:57.566 INFO  ClientConnection:287 | [192.168.224.1:50760 -> 192.168.224.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
2018-10-19 23:46:57.750 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Created consumer on broker [192.168.224.1:50760 -> 192.168.224.9:6650] 
LCE. broker pulsar2 lac_line "1:475",
Last confirmed entry: [1, 475]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 70]
Received: 100000 Curr Entry: [0, 165]
Received: 150000 Curr Entry: [0, 270]
Received: 200000 Curr Entry: [0, 339]
Received: 250000 Curr Entry: [0, 403]
Received: 300000 Curr Entry: [0, 462]
Received: 350000 Curr Entry: [0, 519]
Received: 400000 Curr Entry: [0, 579]
Received: 450000 Curr Entry: [0, 658]
Received: 500000 Curr Entry: [0, 717]
Received: 550000 Curr Entry: [0, 782]
Received: 600000 Curr Entry: [0, 848]
Received: 650000 Curr Entry: [0, 942]
Received: 700000 Curr Entry: [1, 57]
Received: 750000 Curr Entry: [1, 119]
Received: 800000 Curr Entry: [1, 195]
Received: 850000 Curr Entry: [1, 267]
Received: 900000 Curr Entry: [1, 357]
Received: 950000 Curr Entry: [1, 418]
Received: 1000000 Curr Entry: [1, 475]
Read phase complete with message (1,475,324,-1)
2018-10-19 23:47:51.236 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 1000000
Final ack count: 1000000
Final positive ack count: 1000000
Final negative ack count: 0
Messages received: 1000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

One bookie was closed during the message transmission, but the message transmission was not affected. The reader successfully read 1 million messages in sequence, and there was no message duplication.

The results of the 5 tests are as follows.

23:52:20 Start test
23:54:49: 
23:54:49: Test Run #1 on topic kill-bookie-test_1  ------------
23:56:38: Results --------------------------------------------
23:56:38: Final send count: 1000000
23:56:38: Final ack count: 1000000
23:56:38: Final positive ack count: 1000000
23:56:38: Final negative ack count: 0
23:56:38: Messages received: 1000000
23:56:38: Acked messages missing: 0
23:56:38: Non-acked messages received: 0
23:56:38: Out-of-order: 0
23:56:38: Duplicates: 0
23:56:38: ----------------------------------------------------
23:58:54: 
23:58:54: Test Run #2 on topic kill-bookie-test_2  ------------
00:00:50: Results --------------------------------------------
00:00:50: Final send count: 1000000
00:00:50: Final ack count: 1000000
00:00:50: Final positive ack count: 1000000
00:00:50: Final negative ack count: 0
00:00:50: Messages received: 1000000
00:00:50: Acked messages missing: 0
00:00:50: Non-acked messages received: 0
00:00:50: Out-of-order: 0
00:00:50: Duplicates: 0
00:00:50: ----------------------------------------------------
00:03:12: 
00:03:12: Test Run #3 on topic kill-bookie-test_3  ------------
00:05:01: Results --------------------------------------------
00:05:01: Final send count: 1000000
00:05:01: Final ack count: 1000000
00:05:01: Final positive ack count: 1000000
00:05:01: Final negative ack count: 0
00:05:01: Messages received: 1000000
00:05:01: Acked messages missing: 0
00:05:01: Non-acked messages received: 0
00:05:01: Out-of-order: 0
00:05:01: Duplicates: 0
00:05:01: ----------------------------------------------------
00:07:15: 
00:07:15: Test Run #4 on topic kill-bookie-test_4  ------------
00:09:04: Results --------------------------------------------
00:09:04: Final send count: 1000000
00:09:04: Final ack count: 1000000
00:09:04: Final positive ack count: 1000000
00:09:04: Final negative ack count: 0
00:09:04: Messages received: 1000000
00:09:04: Acked messages missing: 0
00:09:04: Non-acked messages received: 0
00:09:04: Out-of-order: 0
00:09:04: Duplicates: 0
00:09:04: ----------------------------------------------------
00:11:23: 
00:11:23: Test Run #5 on topic kill-bookie-test_5  ------------
00:13:23: Results --------------------------------------------
00:13:23: Final send count: 1000000
00:13:23: Final ack count: 1000000
00:13:23: Final positive ack count: 1000000
00:13:23: Final negative ack count: 0
00:13:23: Messages received: 1000000
00:13:23: Acked messages missing: 0
00:13:23: Non-acked messages received: 0
00:13:23: Out-of-order: 0
00:13:23: Duplicates: 0
00:13:23: ----------------------------------------------------

There is no message loss, duplication, or out-of-sequence.

Test scenario 3-Isolate Topic Owner Broker and ZooKeeper

Parameter configuration 2-2-1 3-3.

In this test, we only isolate topic owner broker and ZooKeeper. We predict that the broker owner will continue to receive writes before quarantining ZooKeeper; after the quarantine takes effect, the broker owner will automatically restart, and another broker will take over the role of broker owner.

The test result is expected to be similar to the result of closing bookie.

The complete output of a single run is as follows.

$ python pulsar-test.py isolate-broker-from-zk iso-broker 1 2000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic iso-broker_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 08:22:43.961 INFO  ConnectionPool:63 | Created connection for pulsar://172.22.0.9:6650
2018-10-20 08:22:43.963 INFO  ClientConnection:285 | [172.22.0.1:46968 -> 172.22.0.9:6650] Connected to broker
2018-10-20 08:22:44.664 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/iso-broker_1] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-20 08:22:44.664 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, ] Getting connection from pool
2018-10-20 08:22:45.254 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 08:22:45.254 INFO  ClientConnection:287 | [172.22.0.1:46972 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 08:22:46.329 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, ] Created producer on broker [172.22.0.1:46972 -> 172.22.0.9:6650] 
Send count: 57639 Ack count: 50000 Pos: 50000 Neg: 0
pulsar1 is the topic owner, isolating pulsar1 from zookeepr!!!!!!
Send count: 100918 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 151902 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 200658 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 251287 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 306076 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 353796 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 403469 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 455195 Ack count: 450000 Pos: 450000 Neg: 0
-------------------------------------------------
pulsar1 ISOLATED!
-------------------------------------------------
Send count: 506817 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 557051 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601668 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 652490 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 709134 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 757860 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 805959 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 858283 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 907911 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 951597 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1002732 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1050689 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1104743 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1157332 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1202361 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1256378 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1301586 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1352458 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1407455 Ack count: 1400000 Pos: 1400000 Neg: 0
2018-10-20 08:23:40.624 INFO  ClientConnection:1237 | [172.22.0.1:46972 -> 172.22.0.9:6650] Connection closed
2018-10-20 08:23:40.624 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Schedule reconnection in 0.1 s
2018-10-20 08:23:40.624 INFO  ClientConnection:195 | [172.22.0.1:46972 -> 172.22.0.9:6650] Destroyed connection
2018-10-20 08:23:40.724 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Getting connection from pool
2018-10-20 08:23:41.103 INFO  ConnectionPool:53 | Deleting stale connection from pool for pulsar://pulsar1:6650 use_count: -1 @ 0
2018-10-20 08:23:41.103 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 08:23:41.104 INFO  ClientConnection:287 | [172.22.0.1:47050 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 08:23:42.139 INFO  ClientConnection:1237 | [172.22.0.1:47050 -> 172.22.0.9:6650] Connection closed
2018-10-20 08:23:42.139 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Schedule reconnection in 0.186 s
2018-10-20 08:23:42.139 INFO  ClientConnection:195 | [172.22.0.1:47050 -> 172.22.0.9:6650] Destroyed connection
2018-10-20 08:23:42.325 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Getting connection from pool
2018-10-20 08:23:42.750 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-20 08:23:42.751 INFO  ClientConnection:287 | [172.22.0.1:47054 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-20 08:23:43.632 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Created producer on broker [172.22.0.1:47054 -> 172.22.0.9:6650] 
Send count: 1459178 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1505703 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1551367 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1601042 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1651310 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1707836 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751159 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1802089 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1851082 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1906930 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951920 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-20 08:23:57.505 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-20 08:24:07.620 INFO  ConnectionPool:63 | Created connection for pulsar://172.22.0.9:6650
2018-10-20 08:24:07.621 INFO  ClientConnection:285 | [172.22.0.1:47156 -> 172.22.0.9:6650] Connected to broker
2018-10-20 08:24:07.635 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Getting connection from pool
2018-10-20 08:24:07.665 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-20 08:24:07.665 INFO  ClientConnection:287 | [172.22.0.1:47160 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-20 08:24:07.749 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Created consumer on broker [172.22.0.1:47160 -> 172.22.0.9:6650] 
Last confirmed entry: [1, 658]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 61]
Received: 100000 Curr Entry: [0, 142]
Received: 150000 Curr Entry: [0, 212]
Received: 200000 Curr Entry: [0, 288]
Received: 250000 Curr Entry: [0, 349]
Received: 300000 Curr Entry: [0, 420]
Received: 350000 Curr Entry: [0, 487]
Received: 400000 Curr Entry: [0, 554]
Received: 450000 Curr Entry: [0, 612]
Received: 500000 Curr Entry: [0, 675]
Received: 550000 Curr Entry: [0, 731]
Received: 600000 Curr Entry: [0, 790]
Received: 650000 Curr Entry: [0, 847]
Received: 700000 Curr Entry: [0, 904]
Received: 750000 Curr Entry: [0, 961]
Received: 800000 Curr Entry: [0, 1022]
Received: 850000 Curr Entry: [0, 1080]
Received: 900000 Curr Entry: [0, 1136]
Received: 950000 Curr Entry: [0, 1194]
Received: 1000000 Curr Entry: [0, 1252]
Received: 1050000 Curr Entry: [0, 1310]
Received: 1100000 Curr Entry: [0, 1370]
Received: 1150000 Curr Entry: [0, 1428]
Received: 1200000 Curr Entry: [0, 1487]
Received: 1250000 Curr Entry: [0, 1545]
Received: 1300000 Curr Entry: [0, 1602]
Received: 1350000 Curr Entry: [0, 1660]
Received: 1400000 Curr Entry: [0, 1721]
Received: 1450000 Curr Entry: [1, 1]
Received: 1500000 Curr Entry: [1, 60]
Received: 1550000 Curr Entry: [1, 124]
Received: 1600000 Curr Entry: [1, 186]
Received: 1650000 Curr Entry: [1, 247]
Received: 1700000 Curr Entry: [1, 303]
Received: 1750000 Curr Entry: [1, 361]
Received: 1800000 Curr Entry: [1, 420]
Received: 1850000 Curr Entry: [1, 478]
Received: 1900000 Curr Entry: [1, 539]
Received: 1950000 Curr Entry: [1, 598]
Received: 2000000 Curr Entry: [1, 658]
Read phase complete with message (1,658,138,-1)
2018-10-20 08:24:44.361 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

In this test, we sent 2 million messages, because it will take some time to isolate the owner broker and see the effect. About 1.4 million messages went down and quickly reconnected. Reader reads all messages in an orderly manner, and no messages are repeated.

The results of the 5 tests are as follows.

08:33:38 Start test
08:35:55: 
08:35:55: Test Run #1 on topic iso-broker_1  ------------
08:38:10: Results --------------------------------------------
08:38:10: Final send count: 2000000
08:38:10: Final ack count: 2000000
08:38:10: Final positive ack count: 2000000
08:38:10: Final negative ack count: 0
08:38:10: Messages received: 2000000
08:38:10: Acked messages missing: 0
08:38:10: Non-acked messages received: 0
08:38:10: Out-of-order: 0
08:38:10: Duplicates: 0
08:38:10: ----------------------------------------------------
08:40:29: 
08:40:29: Test Run #2 on topic iso-broker_2  ------------
08:42:36: Results --------------------------------------------
08:42:36: Final send count: 2000000
08:42:36: Final ack count: 2000000
08:42:36: Final positive ack count: 2000000
08:42:36: Final negative ack count: 0
08:42:36: Messages received: 2000000
08:42:36: Acked messages missing: 0
08:42:36: Non-acked messages received: 0
08:42:36: Out-of-order: 0
08:42:36: Duplicates: 0
08:42:36: ----------------------------------------------------
08:44:55: 
08:44:55: Test Run #3 on topic iso-broker_3  ------------
08:47:08: Results --------------------------------------------
08:47:08: Final send count: 2000000
08:47:08: Final ack count: 2000000
08:47:08: Final positive ack count: 2000000
08:47:08: Final negative ack count: 0
08:47:08: Messages received: 2001000
08:47:08: Acked messages missing: 0
08:47:08: Non-acked messages received: 0
08:47:08: Out-of-order: 0
08:47:08: Duplicates: 1000
08:47:08: ----------------------------------------------------
08:49:33: 
08:49:33: Test Run #4 on topic iso-broker_4  ------------
08:51:52: Results --------------------------------------------
08:51:52: Final send count: 2000000
08:51:52: Final ack count: 2000000
08:51:52: Final positive ack count: 2000000
08:51:52: Final negative ack count: 0
08:51:52: Messages received: 2001090
08:51:52: Acked messages missing: 0
08:51:52: Non-acked messages received: 0
08:51:52: Out-of-order: 0
08:51:52: Duplicates: 1090
08:51:52: ----------------------------------------------------
08:54:18: 
08:54:18: Test Run #5 on topic iso-broker_5  ------------
08:56:41: Results --------------------------------------------
08:56:41: Final send count: 2000000
08:56:41: Final ack count: 2000000
08:56:41: Final positive ack count: 2000000
08:56:41: Final negative ack count: 0
08:56:41: Messages received: 2000000
08:56:41: Acked messages missing: 0
08:56:41: Non-acked messages received: 0
08:56:41: Out-of-order: 0
08:56:41: Duplicates: 0
08:56:41: ----------------------------------------------------

There is no message loss or out-of-sequence problem, but message duplication occurs. In the two tests, there were 1000 and 1090 duplicate messages respectively.

When the read integer value is less than or equal to the read message (that is, this is a duplicate message), the reader will record the message ID, the integer value of the previous consumption message, and the current message in [test-name]-duplicates.txt Integer value.

test-output]$ cat iso-broker_duplicates.txt
run|last_msg_id|last_value|curr_msg_id|curr_value
3|(0,1643,999,-1)|1356634|(1,0,0,-1)|1355635
4|(0,1597,792,-1)|1220380|(1,0,0,-1)|1219588
4|(1,712,296,-1)|1811705|(2,0,0,-1)|1811409

There is only one entry in the result of the third run, which means that repeated messages are located in consecutive message blocks. The integer value 1,355,635 appears for the second time in the first message of ledger 1, entry 0, and the message ID is (1,0,0,-1). In ledger 0, a total of 1000 messages are added, and these 1000 messages are also written at the beginning of the next ledger.

A similar situation occurred in the 4th test.

After the first broker restarts, the client resending the message may be the cause of the message duplication. The broker sends 1000 messages to the BK ensemble of pulsar1, but before sending an ack to the client, the broker fails due to disconnection from ZooKeeper. Then, the client reconnects to the new owner pulsar2 and sends 1000 messages again. Pulsar2 restores the previous ledger (that is, the ledger owned by pulsar1), and creates a new ledger after closing this ledger, ready to receive new messages. The Broker receives 1000 messages from the client and writes these messages to the new ledger. So there is a message duplication.

We can enable message deduplication to test the above possible causes. When message deduplication is enabled, the broker stores the last sequence number of each producer in a hash table. If the received serial number is less than all existing serial numbers, then this message is a duplicate message and no processing will be done on it. The broker stores the data (producer, serial number) about the topic in the cursor. When the broker fails over, the new broker will recreate the hash table. Since the hash table is periodically refreshed, if the broker fails, the latest serial number in the hash table may be lost. If the new broker only relies on the hash table to update the snapshot, it is likely that the message will be repeated again when the broker fails over. In order to avoid this situation, the new broker owner in Pulsar reads the latest N entries from the ledger and appends these entries to the hash table to ensure that the data in the hash table will not be incomplete due to failover.

Test scenario 4-Isolate Topic Owner Broker and ZooKeeper, and enable message deduplication

The parameter configuration is 2-2-1 3-3, and message deduplication is enabled.

Except for enabling message deduplication, test scenario 4 is exactly the same as test scenario 3. This test is used to detect whether message deduplication can prevent message duplication.

In the blockade.yml brokerDeduplicationEnabled parameter to the broker's environment to enable message deduplication.

The results of the 5 tests are as follows.

09:24:34 Start test
09:26:56: 
09:26:56: Test Run #1 on topic iso-broker_1  ------------
09:29:27: Results --------------------------------------------
09:29:27: Final send count: 2000000
09:29:27: Final ack count: 2000000
09:29:27: Final positive ack count: 2000000
09:29:27: Final negative ack count: 0
09:29:27: Messages received: 2000000
09:29:27: Acked messages missing: 0
09:29:27: Non-acked messages received: 0
09:29:27: Out-of-order: 0
09:29:27: Duplicates: 0
09:29:27: ----------------------------------------------------
09:32:48: 
09:32:48: Test Run #2 on topic iso-broker_2  ------------
09:35:06: Results --------------------------------------------
09:35:06: Final send count: 2000000
09:35:06: Final ack count: 2000000
09:35:06: Final positive ack count: 2000000
09:35:06: Final negative ack count: 0
09:35:06: Messages received: 2000000
09:35:06: Acked messages missing: 0
09:35:06: Non-acked messages received: 0
09:35:06: Out-of-order: 0
09:35:06: Duplicates: 0
09:35:06: ----------------------------------------------------
09:37:21: 
09:37:21: Test Run #3 on topic iso-broker_3  ------------
09:39:21: Results --------------------------------------------
09:39:21: Final send count: 2000000
09:39:21: Final ack count: 2000000
09:39:21: Final positive ack count: 2000000
09:39:21: Final negative ack count: 0
09:39:21: Messages received: 2000000
09:39:21: Acked messages missing: 0
09:39:21: Non-acked messages received: 0
09:39:21: Out-of-order: 0
09:39:21: Duplicates: 0
09:39:21: ----------------------------------------------------
09:41:39: 
09:41:39: Test Run #4 on topic iso-broker_4  ------------
09:43:51: Results --------------------------------------------
09:43:51: Final send count: 2000000
09:43:51: Final ack count: 2000000
09:43:51: Final positive ack count: 2000000
09:43:51: Final negative ack count: 0
09:43:51: Messages received: 2000000
09:43:51: Acked messages missing: 0
09:43:51: Non-acked messages received: 0
09:43:51: Out-of-order: 0
09:43:51: Duplicates: 0
09:43:51: ----------------------------------------------------
09:46:17: 
09:46:17: Test Run #5 on topic iso-broker_5  ------------
09:48:53: Results --------------------------------------------
09:48:53: Final send count: 2000000
09:48:53: Final ack count: 2000000
09:48:53: Final positive ack count: 2000000
09:48:53: Final negative ack count: 0
09:48:53: Messages received: 2000000
09:48:53: Acked messages missing: 0
09:48:53: Non-acked messages received: 0
09:48:53: Out-of-order: 0
09:48:53: Duplicates: 0
09:48:53: ----------------------------------------------------

It can be seen from the test results that enabling message deduplication can solve the problem of message duplication in test scenario 3.

Test scenario 5-Isolate Bookie and ZooKeeper

Parameter configuration 2-2-1 3-3.

To emphasize, bookie does not need ZooKeeper for reading and writing, only registration, garbage collection, and AutoRecovery in ZooKeeper. Therefore, isolating bookie and ZooKeeper should not have any impact.

First, identify a bookie in the current ledger's ensemble, and isolate it from ZooKeeper during message transmission. The results of the 5 tests are as follows.

$ cat iso-bookie_output.txt
12:03:52 Start test
12:06:10: 
12:06:10: Test Run #1 on topic iso-bookie_1  ------------
12:08:18: Results --------------------------------------------
12:08:18: Final send count: 2000000
12:08:18: Final ack count: 2000000
12:08:18: Final positive ack count: 2000000
12:08:18: Final negative ack count: 0
12:08:18: Messages received: 2000000
12:08:18: Acked messages missing: 0
12:08:18: Non-acked messages received: 0
12:08:18: Out-of-order: 0
12:08:18: Duplicates: 0
12:08:18: ----------------------------------------------------
12:10:35: 
12:10:35: Test Run #2 on topic iso-bookie_2  ------------
12:12:40: Results --------------------------------------------
12:12:40: Final send count: 2000000
12:12:40: Final ack count: 2000000
12:12:40: Final positive ack count: 2000000
12:12:40: Final negative ack count: 0
12:12:40: Messages received: 2000000
12:12:40: Acked messages missing: 0
12:12:40: Non-acked messages received: 0
12:12:40: Out-of-order: 0
12:12:40: Duplicates: 0
12:12:40: ----------------------------------------------------
12:14:55: 
12:14:55: Test Run #3 on topic iso-bookie_3  ------------
12:16:54: Results --------------------------------------------
12:16:54: Final send count: 2000000
12:16:54: Final ack count: 2000000
12:16:54: Final positive ack count: 2000000
12:16:54: Final negative ack count: 0
12:16:54: Messages received: 2000000
12:16:54: Acked messages missing: 0
12:16:54: Non-acked messages received: 0
12:16:54: Out-of-order: 0
12:16:54: Duplicates: 0
12:16:54: ----------------------------------------------------
12:19:10: 
12:19:10: Test Run #4 on topic iso-bookie_4  ------------
12:21:16: Results --------------------------------------------
12:21:16: Final send count: 2000000
12:21:16: Final ack count: 2000000
12:21:16: Final positive ack count: 2000000
12:21:16: Final negative ack count: 0
12:21:16: Messages received: 2000000
12:21:16: Acked messages missing: 0
12:21:16: Non-acked messages received: 0
12:21:16: Out-of-order: 0
12:21:16: Duplicates: 0
12:21:16: ----------------------------------------------------
12:23:36: 
12:23:36: Test Run #5 on topic iso-bookie_5  ------------
12:25:39: Results --------------------------------------------
12:25:39: Final send count: 2000000
12:25:39: Final ack count: 2000000
12:25:39: Final positive ack count: 2000000
12:25:39: Final negative ack count: 0
12:25:39: Messages received: 2000000
12:25:39: Acked messages missing: 0
12:25:39: Non-acked messages received: 0
12:25:39: Out-of-order: 0
12:25:39: Duplicates: 0
12:25:39: ----------------------------------------------------

There is no message loss, duplication, or out-of-sequence.

Test scenario 6-close multiple bookies

Parameter configuration 3-3-1 3-5.

Test scenario 6 is the same as test scenario 2 except that the redundancy number is 3 and 2 bookies are turned off. When using Pulsar, if the number of bookie failures does not exceed Qw-1, there will be no data loss.

You can use the 061a778e0aaa77 command to close any number of kill-bookies[n]

The complete output of a single run is as follows.

$ python pulsar-test.py kill-bookies[2] kill-m-bookies 1 2000000 50000 3-3-1 3-5 false
Creating blockade cluster
Running test with config: E 3 Qw 3 Qa 1 with nodes bookie1 bookie2 bookie3 bookie4 bookie5 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 3-3-1
Test Run #1 on topic kill-m-bookies_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 22:41:21.932 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.176.11:6650
2018-10-20 22:41:21.932 INFO  ClientConnection:285 | [192.168.176.1:47178 -> 192.168.176.11:6650] Connected to broker
2018-10-20 22:41:22.599 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1] [producerName_ = ] [batchSizeInBytes_ = 0

ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统