Pulsar Tiered Storage-S3 Demo

ApachePulsar
中文

This article will first introduce Pulsar's hierarchical storage. Then use an example to demonstrate how to use Amazon S3 as secondary storage in tiered storage.

introduce

Each topic in Pulsar consists of an ordered list of segments. Pulsar will only write the last paragraph. Except for the last paragraph, all the previous segments have been sealed, and no more data can be added to them. This is Pulsar's segment-based storage structure.

Pulsar's segment-based storage architecture supports the storage of the entire storage cluster in the size of a topic. But with the gradual increase of the partition data to be saved, storage will become expensive. One strategy to solve this problem is to use Tiered Storage.

One scenario of using hierarchical storage is that the user wants to save a relatively large topic data for a relatively long time. For example, a topic contains the click records of all customers, and we can use the data in this topic to train the recommendation system. If the data of this topic is kept, when the training algorithm is changed, we can reuse the historical data to train a new recommendation system.

Pulsar's hierarchical storage function allows the older backlog of data to be offloaded to secondary storage for long-term storage, thereby freeing up space in BookKeeper, thereby reducing storage costs. Using hierarchical storage, we can automatically move the old data segment (Segment) in a topic from BookKeeper to cheaper secondary storage. For Client, the whole process is transparent and seamless.

图片

When Offload is triggered manually or automatically, Broker will move the segments in the primary storage to the secondary storage one by one.

Pulsar currently supports the use of S3 and Google-Cloud-Storage as secondary storage. Users can flexibly configure the size of Topic that they want to keep in BookKeeper (primary storage), and how long the data will be deleted from BookKeeper after being moved to secondary storage (default 4 hours).

We will use S3 as an example in this article to demonstrate how to use this feature of Pulsar.

Operations With S3

In this example, the Pulsar-2.1.1 version is used, which mainly includes 3 steps:

  • Create a bucket in Amazon S3 (
  • Download the Pulsar package, configure Offload for Pulsar's Broker, and start Pulsar in Standalone mode;
  • Use Pulsar's producer and generate data, trigger Offload, and verify.

First, create a bucket in S3

  1. Log in to the AWS console and select the S3 service:

图片

  1. To create a bucket, click "Create bucket", fill in the name of the bucket, then click next and keep clicking confirm.

图片

  1. After the above operation, you can see that a new bucket has been created.

图片

  1. Confirm that aws access is configured on this machine

图片

Second, download the Pulsar package and configure

  1. Download the latest Pulsar Binary file (apache-pulsar-xxx-bin.tar.gz) from Pulsar's official website ( http://pulsar.apache.org/en/download/), unzip it, and prepare to modify the configuration file conf/ standalone.conf.

图片

  1. Modify the Offload option in the configuration file conf/standalone.conf and set the bucket created in the first step:

 managedLedgerOffloadDriver=S3

 s3ManagedLedgerOffloadBucket=offload-test-aws

 s3ManagedLedgerOffloadRegion=us-east-1

图片

Modify the size configuration of each segment in the configuration file conf/standalone.conf, which makes each segment smaller and easier to generate new segments.

图片

  1. Start Pulsar in standalone mode in the terminal:

Run the command bin/pulsar standalone

图片

Three, generate data in Pulsar, trigger Offload, and verify

  1. Start a consumer in the terminal to ensure that the data to be generated will not be discarded because there is no consumer:

bin/pulsar-client consume -s “my-sub-name“ my-topic-for-offload

  1. Open a new terminal window and run the following command twice to generate 2000 messages. Ensure that there are two segments in the topic, so that the first segment can be moved to S3.

bin/pulsar-client produce my-topic-for-offload --messages "hello pulsar this is the content for each message" -n 1000

  1. Trigger Offload manually with the command line:

bin/pulsar-admin topics offload --size-threshold 10K public/default/my-topic-for-offload

图片

  1. Use the command to wait for the offload to succeed:

bin/pulsar-admin topics offload-status public/default/my-topic-for-offload

图片

  1. Verify that the first segment has been moved to S3:

In the terminal of s3, you can see that ledger-31 has been stored.

图片

Using the command line, you can also see that there are two segments in this topic, and the offloaded status of the first segment - ledger-31 is true.

图片

The above steps are the whole process of using S3 as secondary storage.

If you want to know more about the content of hierarchical storage, please refer to Pulsar's official website:

http://pulsar.apache.org/docs/en/concepts-tiered-storage/

http://pulsar.apache.org/docs/en/cookbooks-tiered-storage/

阅读 1k

ApachePulsar
Apache Pulsar 是 Apache 软件基金会顶级项目,是下一代云原生分布式消息流平台,集消息、存储、轻量化...

Apache软件基金会顶级项目,下一代云原生分布式消息系统

186 声望
920 粉丝
0 条评论

Apache软件基金会顶级项目,下一代云原生分布式消息系统

186 声望
920 粉丝
文章目录
宣传栏