1

1 Background

Curve ( https://github.com/opencurve/curve ) is a cloud-native software-defined storage system with high performance, easy operation and maintenance, and full-scenario support independently designed and developed by NetEase Shufan. It aims to meet some scenarios that Ceph's own architecture cannot support. requirements, it will be officially open sourced in July 2020. It currently consists of two sub-projects, CurveBS and CurveFS, which provide distributed block storage and distributed file storage respectively. Among them, CurveBS has become the distributed shared storage base of the open source cloud native database PolarDB for PostgreSQL , supporting its storage-computing separation architecture.

curve_logo

In the design of CurveBS, the data consistency of the data server ChunkServer is realized by the distributed consistency protocol based on raft.

A typical implementation of write Op based on raft consistency is shown in the following figure:

Taking the common three copies as an example, the general process is as follows:

  1. First, the client sends the write op (step 1), and after the write op reaches the leader (if there is no leader, the leader election will be performed first, and the write op is always sent to the leader first), and the leader will first receive the write op and generate WAL (write ahead log) , persist the WAL to the local storage engine (step 2), and simultaneously send the WAL to two followers via log sending rpc in parallel (step 3).
  2. After receiving the log request from the leader, the two followers persist the received logs to the local storage engine (step 4), and return the log writing success to the leader (step 5).
  3. Generally speaking, the leader log will always complete the placement first. At this time, after receiving a successful reply from another follower's log, most of the conditions are met, and the write Op will be submitted to the state machine, and the write Op will be written. Write to the local storage engine (step 6).
  4. After the above steps are completed, it means that the write Op has been completed, and the write success can be returned to the client (step 7). At a later time, the two followers will also receive the message of the leader's log commit, and apply the write Op to the local storage engine (step 9).

In the current implementation of CurveBS, when raft apply to the local storage engine (datastore) to write Op, the sync write method based on O_DSYNC is used. In fact, when the log has been written based on raft, writing Op can safely return to the client without sync, thereby reducing the delay of writing Op, which is the principle of optimizing the writing delay described in this article.

The code is as follows, and the O_DSYNC flag is used in the Open function of the chunkfile.

 CSErrorCode CSChunkFile::Open(bool createFile) {
    WriteLockGuard writeGuard(rwLock_);
    string chunkFilePath = path();
    // Create a new file, if the chunk file already exists, no need to create
    // The existence of chunk files may be caused by two situations:
    // 1. getchunk succeeded, but failed in stat or load metapage last time;
    // 2. Two write requests concurrently create new chunk files
    if (createFile
        && !lfs_->FileExists(chunkFilePath)
        && metaPage_.sn > 0) {
        std::unique_ptr<char[]> buf(new char[pageSize_]);
        memset(buf.get(), 0, pageSize_);
        metaPage_.version = FORMAT_VERSION_V2;
        metaPage_.encode(buf.get());

        int rc = chunkFilePool_->GetFile(chunkFilePath, buf.get(), true);
        // When creating files concurrently, the previous thread may have been
        // created successfully, then -EEXIST will be returned here. At this
        // point, you can continue to open the generated file
        // But the current operation of the same chunk is serial, this problem
        // will not occur
        if (rc != 0  && rc != -EEXIST) {
            LOG(ERROR) << "Error occured when create file."
                       << " filepath = " << chunkFilePath;
            return CSErrorCode::InternalError;
        }
    }
    int rc = lfs_->Open(chunkFilePath, O_RDWR|O_NOATIME|O_DSYNC);
    if (rc < 0) {
        LOG(ERROR) << "Error occured when opening file."
                   << " filepath = " << chunkFilePath;
        return CSErrorCode::InternalError;
    }
...
}

2 Problem Analysis

The reason why O_DSYNC was used before was to consider that in the raft snapshot scenario, if the data is not placed on the disk, once the snapshot is started and the log is also truncated, the data may be lost. At present, if you modify the Apply write or not sync, you need to solve this problem first. question.
First, you need to analyze the process of taking snapshots on the Curve ChunkServer, as shown in the following figure:

Several key points of the snapshot process:

  1. The process of taking a snapshot is executed in the StateMachine queue by entering the StateMachine and the Apply of the read and write Op;
  2. The last_applied_index included in the snapshot has been saved before calling StateMachine to save the snapshot, that is to say, when the snapshot is executed, it must be guaranteed that the saved last_applied_index has been applied by StateMachine;
  3. And if you modify StatusMachine's write Op Apply to remove O_DSYNC, that is, do not sync, then there may be a possibility that the snapshot is truncate to last_applied_index, and the Apply that writes Op has not really synced to the disk, which is a problem we need to solve;

3 Solutions

There are two solutions:

3.1 Option 1

  1. Since the snapshot needs to ensure that the write Op of the apply until the last_applied_index must be Synced, the easiest way is to perform a Sync when the snapshot is performed. There are 3 ways, the first is to perform an FsSync on the entire disk. The second way, since our snapshot process needs to save all chunk files in the current copy to the snapshot metadata, then we naturally have a list of all the file names of the current snapshot, then we can take a snapshot of all files. Sync one by one. The third method, since the number of chunks in a replication group may be large, and the number of written chunks may not be many, then you can save the list of chunkids that need to be synced when the datastore executes the write op, then when taking a snapshot, as long as the sync A chunk from the above list will do.
  2. In view of the fact that the above three sync methods may be time-consuming, and our snapshot process is currently executed "synchronously" in the state machine, that is, the snapshot process will block IO, then we can consider changing the snapshot process to asynchronous execution. At the same time, this modification can also reduce the impact on IO jitter when taking snapshots.

3.2 Option 2

The second solution is more complicated. Since the O_DSYNC write is removed, we cannot guarantee that the write Ops up to the last_applied_index are all Synced, so consider splitting the ApplyIndex into two, namely last_applied_index and last_synced_index. The specific methods are as follows:

  1. Split last_applied_index into two last_applied_index and last_synced_index, where last_applied_index has the same meaning, increase last_synced_index, after performing a full FsSync, assign last_applied_index to last_synced_index;
  2. In the aforementioned snapshotting step, change the last_applied_index saved before the snapshot to the snapshot metadata to last_synced_index, so as to ensure that the data contained in the snapshot must have been synced when the snapshot is taken;
  3. We need a background thread to execute FsSync regularly, and execute Sync Task periodically through timer. The execution process may be as follows: First, the background sync thread traverses all state machines, gets all the current last_applied_index, executes FsSync, and then assigns the above last_applied_index to the last_synced_index for the state machine;

3.3 Advantages and disadvantages of the two schemes:

  1. Scheme 1 is relatively simple to change, only the Curve code needs to be changed, and no braft code needs to be changed, which is non-intrusive to the braft framework; scheme 2 is more complicated, requiring changes to the braft code;
  2. From the perspective of snapshot execution performance, solution 1 will slow down the original snapshot. Since the original snapshot is synchronous, it is best to change it to asynchronous snapshot execution in this modification; of course, solution 2 can also optimize the original snapshot to be asynchronous. , thereby reducing the impact on IO;

3.4 The plan adopted:

  1. The first implementation method is adopted because the non-intrusive modification of braft is beneficial to the stability of the code and the subsequent compatibility.
  2. As for the sync method of chunks, the third method of solution 1 is adopted, that is, when the datastore executes the write op, save the list of chunkids that need to be synced, and at the same time, when taking a snapshot, sync the chunkids in the above list, so as to ensure that all the chunks are placed on the disk. . This approach avoids the IO impact of frequent FsSync on all chunkservers. In addition, when executing the above sync, the batch sync method is adopted, and the chunkid of the sync is deduplicated, thereby reducing the number of actual syncs, thereby reducing the impact on the foreground IO.

4 POCs

The following is a poc test to test whether IOPS, delay, etc. are optimized for various scenarios when O_DSYNC is directly removed. Each group of tests is tested at least twice, and one group is selected.

The fio test parameters used in the test are as follows:

  • 4K random write test single volume IOPS:

     [global]
    rw=randwrite
    direct=1
    iodepth=128
    ioengine=libaio
    bsrange=4k-4k
    runtime=300
    group_reporting
    size=100G
    
    [disk01]
    filename=/dev/nbd0
  • 512K sequential write test single volume bandwidth:

     [global]
    rw=write
    direct=1
    iodepth=128
    ioengine=libaio
    bsrange=512k-512k
    runtime=300
    group_reporting
    size=100G
     
     
    [disk01]
    filename=/dev/nbd0
  • 4K single-depth random write test latency:
 [global]
rw=randwrite
direct=1
iodepth=1
ioengine=libaio
bsrange=4k-4k
runtime=300
group_reporting
size=100G

[disk01]
filename=/dev/nbd0

Cluster configuration:

machine roles disk
server1 client,mds,chunkserver ssd/hdd*18
server2 mds,chunkserver ssd/hdd*18
server3 mds,chunkserver ssd/hdd*18

4.1 HDD comparison test results

Scenes Before optimization Optimized
Single volume 4K random write IOPS=5928, BW=23.2MiB/s, lat=21587.15usec IOPS=6253, BW=24.4MiB/s, lat=20465.94usec
Single volume 512K sequential write IOPS=550, BW=275MiB/s,lat=232.30msec IOPS=472, BW=236MiB/s,lat=271.14msec
Single-volume 4K single-depth random write IOPS=928, BW=3713KiB/s, lat=1074.32usec IOPS=936, BW=3745KiB/s, lat=1065.45usec

In the above test, the performance of the RAID card cache strategy writeback is slightly improved, but the improvement effect is not obvious. In the 512K sequential write scenario, the performance even decreases slightly, and it is also found that there is severe IO jitter after removing O_DSYNC.

We suspect that the performance improvement is not obvious due to the cache of the RAID card. Therefore, we set the cache policy of the RAID card to writethough mode and continue the test:

Scenes Before optimization Optimized
Single volume 4K random write IOPS=993, BW=3974KiB/s,lat=128827.93usec IOPS=1202, BW=4811KiB/s, lat=106426.74usec
Single volume single depth 4K random write IOPS=21, BW=85.8KiB/s,lat=46.63msec IOPS=38, BW=154KiB/s,lat=26021.48usec

In the writethough mode of the RAID card cache strategy, the performance improvement is more obvious, and the 4K random write of a single volume is about 20% improved.

4.2 SSD comparison test results

The test of SSD is tested in RAID pass-through mode (JBOD), and the performance comparison is as follows:

Scenes Before optimization Optimized
Single volume 4k random write bw=83571KB/s, iops=20892,lat=6124.95usec bw=178920KB/s, iops=44729,lat=2860.37usec
Single volume 512k sequential write bw=140847KB/s, iops=275,lat=465.08msec bw=193975KB/s, iops=378,lat=337.72msec
Single volume single depth 4k random write bw=3247.3KB/s, iops=811,lat=1228.62usec bw=4308.8KB/s, iops=1077,lat=925.48usec

It can be seen that in the above scenarios, the test effect is greatly improved. In the 4K random write scenario, the IOPS is increased by almost 100%, and the 512K sequential write is also greatly improved, and the latency is also greatly reduced.

5 Summary

The above optimization is applicable to Curve block storage. Based on the RAFT distributed consistency protocol, it can reduce the immediate disk drop of the RAFT state machine applied to the local storage engine, thereby reducing the write delay of Curve block storage and improving the write performance of Curve block storage. Tested in the SSD scenario, the performance has been greatly improved. For HDD scenarios, since the existence of RAID card cache is usually enabled, the effect is not obvious, so we provide a switch, in HDD scenarios, you can choose not to enable this optimization.

The author of this article: Xu Chaojie, senior system development engineer of Netease Shufan


网易数帆
391 声望550 粉丝

网易数智旗下全链路大数据生产力平台,聚焦全链路数据开发、治理及分析,为企业量身打造稳定、可控、创新的数据生产力平台,服务“看数”、“管数”、“用数”等业务场景,盘活数据资产,释放数据价值。