Pulsar Tiered Storage-S3 Demo

This article will first introduce Pulsar's hierarchical storage. Then use an example to demonstrate how to use Amazon S3 as secondary storage in tiered storage.

introduce

Each topic in Pulsar consists of an ordered list of segments. Pulsar will only write the last paragraph. Except for the last paragraph, all the previous segments have been sealed, and no more data can be added to them. This is Pulsar's segment-based storage structure.

Pulsar's segment-based storage architecture supports the storage of the entire storage cluster in the size of a topic. But with the gradual increase of the partition data to be saved, storage will become expensive. One strategy to solve this problem is to use Tiered Storage.

One scenario of using hierarchical storage is that the user wants to save a relatively large topic data for a relatively long time. For example, a topic contains the click records of all customers, and we can use the data in this topic to train the recommendation system. If the data of this topic is kept, when the training algorithm is changed, we can reuse the historical data to train a new recommendation system.

Pulsar's hierarchical storage function allows the older backlog of data to be offloaded to secondary storage for long-term storage, thereby freeing up space in BookKeeper, thereby reducing storage costs. Using hierarchical storage, we can automatically move the old data segment (Segment) in a topic from BookKeeper to cheaper secondary storage. For Client, the whole process is transparent and seamless.

When Offload is triggered manually or automatically, Broker will move the segments in the primary storage to the secondary storage one by one.

Pulsar currently supports the use of S3 and Google-Cloud-Storage as secondary storage. Users can flexibly configure the size of Topic that they want to keep in BookKeeper (primary storage), and how long the data will be deleted from BookKeeper after being moved to secondary storage (default 4 hours).

We will use S3 as an example in this article to demonstrate how to use this feature of Pulsar.

Operations With S3

In this example, the Pulsar-2.1.1 version is used, which mainly includes 3 steps:

Create a bucket in Amazon S3 (
Download the Pulsar package, configure Offload for Pulsar's Broker, and start Pulsar in Standalone mode;
Use Pulsar's producer and generate data, trigger Offload, and verify.

First, create a bucket in S3

To create a bucket, click "Create bucket", fill in the name of the bucket, then click next and keep clicking confirm.

After the above operation, you can see that a new bucket has been created.

Confirm that aws access is configured on this machine

Second, download the Pulsar package and configure

Download the latest Pulsar Binary file (apache-pulsar-xxx-bin.tar.gz) from Pulsar's official website ( http://pulsar.apache.org/en/download/), unzip it, and prepare to modify the configuration file conf/ standalone.conf.

Modify the Offload option in the configuration file conf/standalone.conf and set the bucket created in the first step:


 managedLedgerOffloadDriver=S3

 s3ManagedLedgerOffloadBucket=offload-test-aws

 s3ManagedLedgerOffloadRegion=us-east-1

Modify the size configuration of each segment in the configuration file conf/standalone.conf, which makes each segment smaller and easier to generate new segments.

Start Pulsar in standalone mode in the terminal:

Run the command bin/pulsar standalone

Three, generate data in Pulsar, trigger Offload, and verify

Start a consumer in the terminal to ensure that the data to be generated will not be discarded because there is no consumer:

bin/pulsar-client consume -s “my-sub-name“ my-topic-for-offload

Open a new terminal window and run the following command twice to generate 2000 messages. Ensure that there are two segments in the topic, so that the first segment can be moved to S3.

bin/pulsar-client produce my-topic-for-offload --messages "hello pulsar this is the content for each message" -n 1000

Trigger Offload manually with the command line:

bin/pulsar-admin topics offload --size-threshold 10K public/default/my-topic-for-offload

Use the command to wait for the offload to succeed:

bin/pulsar-admin topics offload-status public/default/my-topic-for-offload

Verify that the first segment has been moved to S3:

In the terminal of s3, you can see that ledger-31 has been stored.

Using the command line, you can also see that there are two segments in this topic, and the offloaded status of the first segment - ledger-31 is true.

The above steps are the whole process of using S3 as secondary storage.

If you want to know more about the content of hierarchical storage, please refer to Pulsar's official website:

http://pulsar.apache.org/docs/en/concepts-tiered-storage/

http://pulsar.apache.org/docs/en/cookbooks-tiered-storage/

Pulsar Tiered Storage-S3 Demo

introduce

Operations With S3

First, create a bucket in S3

Second, download the Pulsar package and configure

Three, generate data in Pulsar, trigger Offload, and verify

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性