大数据 - Best Practice | Abandon Ceph, Salesforce uses Apache BookKeeper to achieve the strongest storage in the cloud - ApachePulsar

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

Highlights of this article

The traditional way to make storage systems aware of the cloud is direct migration, which performs well, but from our experience, it is better to reconstruct the cloud-aware architecture.
Currently, when deploying Apache BookKeeper in a cross-regional environment, it is necessary to manually map storage nodes to specific regions/availability regions, but when regions are interrupted, durability and availability will be affected.
Salesforce's unique way to make Apache BookKeeper aware of the cloud is to use intelligent storage nodes to allow it to operate effectively when deployed in the cloud, and to ensure durability and availability.
These methods simplify the improvement, upgrade and restart operations of the cluster, and have the lowest impact on consumer services.

At Salesforce, we need a storage system that can handle two streams at the same time: one stream is used to pre-write logs, and the other stream is used to process data. But for these two streams, our requirements are contradictory: the write latency of the write-ahead log stream is low, but the read throughput is high; the write throughput of the data stream is high, but the random read latency is low. As a leader in cloud computing, our storage system must have cloud awareness (increasing availability and durability requirements). In order to run on commercial hardware and facilitate expansion, we cannot change the deployment model design.

Open source solution

After conducting preliminary research on the storage system, we considered whether to build one or buy one.

Considering the overall planning and major business drivers such as time to market, resources, and costs, we decided to use an open source storage system.

After reviewing the open source code, we have two alternatives: Ceph and Apache BookKeeper. Since this system needs to be open to customers, scalability and consistency are both important. The system must fully meet the CAP (Consistency; Availability: Availability; Partition Tolerance: Partition Tolerance) requirements of the use case and our own special needs . First, let's take a look at the performance of BookKeeper and Ceph in CAP and other aspects.

Ceph can guarantee consistency and partition fault tolerance. The read path can provide availability and partition fault tolerance with the help of unreliable read; but it is not easy to make the write path guarantee availability and partition fault tolerance. Also, we cannot change the deployment data.

We decided to choose Apache BookKeeper. BookKeeper supports only append/immutable data storage, and uses highly replicable distributed logs to meet our requirements for system CAP. BookKeeper also has the following features:

Acked writes are always readable.
The read entry is always readable.
Without a Master server, the client uses Apache ZooKeeper to implement a consensus algorithm to obtain metadata.
Data layout does not require complicated hashing/calculation.

Salesforce has always supported open source products, and the Apache BookKeeper community is active and full of vitality.

Apache BookKeeper-almost perfect, but there is room for improvement

Apache BookKeeper has almost fulfilled all the requirements of our storage system, but some work still needs to be done. First, let's take a look at what requirements Apache BookKeeper can achieve.

The storage node is called Bookie; a group of Bookies is called Ensemble.
The smallest unit of writing is Entry, and Entry cannot be changed.
A group of Entry is called Ledger, Ledger can only be appended and cannot be changed.
The number of books written or copied is called Write Quorum-the maximum number of copies of the Entry.
The number of Bookies before the confirmation is written is called Ack quorum-the minimum number of copies of the Entry.

From the perspective of durability, Ledger replicates across Bookie Ensemble, and Entry within Ledger can cross Ensemble.

Ensemble Size: 5 Write Quorum Size: 3 Ack Quorum Size: 2

Writes are confirmed according to Write Quorum and Ack Quorum (configurable) to ensure low write latency and high scalability.

But in fact, it is not easy to run BookKeeper on commodity hardware in the cloud.

The data layout strategy does not have cloud awareness and does not take into account the underlying cloud service provider (cloud infrastructure). At present, the deployment method of some users is to manually identify nodes in different availability zones, and logically group them, and then improve the data layout strategy by group. This is a solution, but it does not support regional failures, and it also reduces the ease of use of the system when maintaining and upgrading large clusters.

In addition, there have been downtimes in the available areas of all cloud infrastructures; and the general understanding is that applications must be designed accordingly for these failures. A good example is that during the Christmas period of 2012, the availability of Amazon's network service area failed, and the public cloud infrastructure on which Netflix's underlying relied was shut down, while the Netflix service could still run on a limited capacity.

Problems in the public cloud

The public cloud infrastructure is easy to expand, which reduces the cost of use and maintenance to a certain extent. Therefore, from websites to applications, and even enterprise-level software, basically all run on the infrastructure provided by public cloud service providers. However, the public cloud also has its shortcomings. It may be unavailable at the node, region, or regional level. The underlying infrastructure is unavailable, and users can't do anything. The cause may be the failure of certain machines, regions or regions, or the increase in network delay caused by hardware failure. So ultimately when running applications on public cloud infrastructure, developers need to consider the problems caused by failures when designing.

Apache BookKeeper itself cannot solve this problem, so we need to design a fix by ourselves.

Salesforce refactored

After having a certain understanding of the problem, we began to consider solutions to make BookKeeper cloud-aware and meet our following requirements.

Bookie in a public cloud cluster needs an identity.
The data layout strategy is designed according to the distribution of Ensemble in the available area to achieve better high availability and simplify maintenance and deployment.
Improve Bookie's existing functions, such as reading, writing, data copy, etc., so that Bookie can make full use of the advantages of multi-regional layout and calculate the cost of transferring data across regions.
The above work has nothing to do with cloud infrastructure.

Our solution is as follows:

Cloud awareness: Cookies and Kubernetes

The existing BookKeeper architecture provides a unique identifier for all Bookies (assigned at the first startup). The logo is stored in the metadata store (ZooKeeper) and can be accessed by other Bookies or clients.

The first step to make Apache BookKeeper cloud-aware is to make all Bookies available to where it is deployed in the Kubernetes cluster. We believe that cookie data is the best way to obtain location information.

Therefore, we have added the networkLocation field to the Cookie, which contains two parts: the available area and the upgrade domain, which are used to locate Bookie. Kubernetes has nothing to do with cloud infrastructure. We can use the Kubernetes API to query the underlying available zone information. We also generated the upgradeDomain field based on a formula involving sequential indexing of host names. It can be used for rolling upgrades without affecting the availability of the cluster.

The above fields and corresponding values are generated when the machine is started, and saved in the metadata storage for users to access. This information can be used to generate Ensemble, assign Bookie to Ensemble, and determine which Bookie to copy data from, and to which Bookie to store the copied data.

Public cloud layout strategy

Now that the client is smart enough to communicate with Bookie in certain areas, the next step is to ensure that there is a data layout strategy that can use this information. We developed ZoneAwareEnsemblePlacementPolicy (ZEPP). This is a two-level hierarchical layout strategy designed for cloud-based deployment. ZEPP can obtain Availability Zone (AZ) and upgradeDomains (UD) information.

AZ is the logical concept of isolating the data center in the region; UD is a group of nodes in the AZ. Shutting down the UD will not affect the service. UD can also monitor the shutdown and restart of the region.

The figure below is a schematic diagram of a deployment that ZEPP can use. This deployment method takes into account the AZ and UD information in the Cookie, and groups Bookie nodes accordingly.

Availability & Delay & Cost

After the above adjustments, Apache BookKeeper can be cloud-aware. But cost is also one of the factors that must be considered when designing the architecture. Most cloud infrastructures charge one-way for outgoing service data, and the cost of transmission across availability zones will vary. This is an important factor that the BookKeeper client needs to consider, because it now randomly selects a Bookie from Ensemble to read.

If Bookie and the client belong to different Availability Zones, it will increase unnecessary costs. Data replication may occur between Bookies across availability zones. When the availability zone fails, the usage cost will increase.

We deal with these special situations in the following ways:

Reorder read

Currently, the BookKeeper client randomly selects Bookie from Ensemble to read. With the reordering read feature, the client can now choose Bookie, thereby reducing read latency and cost.

After enabling reordering and reading, the client selects Bookie in the following order:

Bookie in the local area that meets the requirements and has few pending requests;
Bookie in the remote area that meets the requirements and has few pending requests;
The next Bookie in the local area with the least failure or pending requests higher than the set threshold;
The next Bookie in the remote area with the least failures or pending requests higher than the set threshold.

According to the above sequence, a system that has been running for a long time and has failed can also meet our requirements for delay and cost.

Handling area failures

When the area is closed, all Bookies in different Ensemble begin to copy data to the Bookie in the current available area, so as to meet the Ensemble Size and Quorum requirements, causing "stun group problems."

To solve this problem, we must first determine when the area is closed. The failure may be a temporary operational error. For example, a network failure causes an area to be unavailable. We do not want the system to replicate terabytes of data; but at the same time, we must also be prepared to deal with real failures. Our solution consists of two steps:

Identify whether the area is a real failure or a temporary failure;
Convert the large-scale automatic copying of the entire area to manual operation.

The picture below shows our response plan when the area is closed and restarted.

The values of HighWaterMark and LowWaterMark can be calculated according to the number of Bookies available in the area and the total amount of Bookies in the area. The user can set thresholds for these two values, and the system can determine the fault condition based on this, and then determine the fault type.

When a region is marked as closed, we will disable automatic replication to avoid automatic replication of terabytes of data across regions. In addition, we have added an alarm in the place of data replication to remind users of possible regional failures. We believe that the operation and maintenance experts can distinguish the noise from the actual failure and decide whether to start automatically copying the data of the entire area.

We can also start the disabled automatic copy of Bookie through the shell command.

Our gains

Apache BookKeeper is an open source project, the community is very active, and has been actively discussing a series of challenges. Since BookKeeper is a component that stores data, its cloud awareness capabilities are very important to many users.

The changes described in this article have been field-tested in Salesforce. Currently, with Apache BookKeeper, we can already support AZ and AZ + 1 failures. However, such architectural changes will inevitably affect availability, delay, cost, ease of deployment and maintenance. The community has accepted some of the changes we submitted, and we will continue to contribute to the community. We hope that these changes can simplify cluster patching, upgrades, and restart operations while minimizing the impact on consumer services.

About the author
Anup Ghatage works at Salesforce and is mainly responsible for cloud infrastructure and data engineering. He has worked for SAP and Cisco Systems and has a keen interest in maintaining and developing highly scalable systems. He graduated from the University of Pune with a bachelor's degree in computer science, and graduated with a master's degree from Carnegie Mellon University. He is the committer of Apache BookKeeper and actively participates in the development of Apache BookKeeper. Welcome to follow Anup on Twitter (@ghatageanup).

Best Practice | Abandon Ceph, Salesforce uses Apache BookKeeper to achieve the strongest storage in the cloud

About Apache Pulsar

Highlights of this article

Open source solution

Apache BookKeeper-almost perfect, but there is room for improvement

Problems in the public cloud

Salesforce refactored

Cloud awareness: Cookies and Kubernetes

Public cloud layout strategy

Availability & Delay & Cost

Reorder read

Handling area failures

Our gains

About the author

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

Monorepo：让你的项目脱胎换骨，既能代码复用，又能独立部署！

2024 OSCAR 开源产业大会在京召开

张晋涛：KubeCon China 2024 回顾

保证Redis和数据库数据一致性的方法

2024OSCAR开源产业大会 | 开源项目社区与商业化分论坛精彩前瞻！！！

中国信通院发布 2024 可信开源系列评估结果