The practice of self-developed disk-type feature storage engine RDB in cloud music

The author of this article: Qi Tao, from the Data Intelligence Department - Real-time Computing Group, is mainly responsible for the feature storage of cloud music algorithms.

business background

The cloud music recommendation and search business has a large amount of algorithmic feature data, which needs to be stored in the form of key-value to provide online read and write services. These features are mainly output from spark or flink tasks on the big data platform, such as song features, user features, etc. They are characterized by a large amount of data, regular full update or real-time incremental update every day, and high query performance requirements. Some of these algorithm feature data are stored in the redis/tair memory storage system, and some are stored in the myrocks/hbase disk storage system.

In order to reduce the cost of accessing a variety of different storage systems, and to customize the development of the storage characteristics of the algorithm characteristic kv data, we introduce the rocksdb engine under the tair distributed storage framework to support the data volume at a low cost. Online storage of large algorithmic feature kv data scenes.

The following is a brief introduction to tair's scheme for introducing rocksdb, and then our practice in algorithm feature kv storage. In order to distinguish the memory storage with memcache as the engine and the disk storage with rocksdb as the engine under the tair framework, we call them MDB and RDB respectively.

Introduction to RDB

As a distributed storage framework, tair is divided into two parts: ConfigServer and DataServer. DataServer consists of multiple nodes responsible for the actual storage of data. All kv data is divided into several buckets according to the hash value calculated for the key. The data in each bucket can be stored in multiple copies on different DataServer nodes. The specific mapping rules are determined by the routing table constructed by ConfigServer.

tair分布式存储框架

ConfigServer maintains the status of all DataServer nodes. If any node increases or decreases, it initiates data migration and builds a new routing table. DataServer supports different underlying storage engines. The underlying engine needs to implement the basic operations put/get/delete of kv data, as well as the scan interface for full data scanning. The Client reads and writes data to the DataServer node actually requested through the routing table provided by the ConfigServer. The master node of the corresponding bucket is requested to read and write data. If the data is written, the master-slave replication of the data is completed within the DataServer.

rocksdb存储引擎原理

Rocksdb is an open source kv storage engine, the principle is lsm (log structured merge), lsm is a hierarchical structure composed of many sst files. Each sst file contains a certain amount of kv data with corresponding metadata information, and the kvs in the sst file are sorted by key. In a hierarchical manner, the data of each level is regularly merged (compaction) to delete invalid data. The newly written data is placed at level 0. After the scale of level 0 reaches the threshold, it is compacted to level 1, and so on. All sst files of each layer are also kept in order and do not overlap (except for level0), and are retrieved in each level from top to bottom during query.

When introducing rocksdb into tair, we designed the storage format of each kv data as follows.

RDB中的kv格式

The key stored in rocksdb is composed of bucket_id+area_id+original key. The area_id refers to the business table id, and different data tables have different areas. The role of bucket_id is to facilitate data migration by buckets, because the data in rocksdb is stored in order, and the data in the same bucket can be clustered together to improve efficiency through prefix scanning. The role of area_id is to distinguish different business tables and avoid overlapping keys. In fact, for tables with a large amount of data, we will store them in a separate column family in rocksdb, which can also avoid key overlap.

The value stored in rocksdb is spliced by meta+original value. Among them, meta saves the modification time and expiration time of kv, because the data in rocksdb can be judged whether to be discarded during compaction, and expired data can be deleted by customizing the CompactionFilter.

bulkload bulk data import

bulkload scheme

There are many scenarios of algorithm feature data. The latest full data is calculated offline on the big data platform every day, and then imported into the kv storage engine. The scale of these data tables is often more than 100GB and the number of records is more than 100 million. The basic version of RDB can only be written one by one by calling the put interface, which requires many concurrent tasks to import the full amount of data through the put method, occupying the computing resources of the big data platform.

Moreover, because the order of data put into RDB is out of order, this will cause the IO pressure of rocksdb to be high during compaction, because it is necessary to regenerate the overall ordered sst file after sorting a large number of kvs. This is the write amplification problem of rocksdb. The actual amount of data written by rocksdb will be amplified by dozens of times, and the IO pressure of the disk will cause the response time of read requests to fluctuate.

In response to this problem, we borrowed the bulkload mechanism of hbase to improve the import efficiency. When importing large-scale offline features, the original data is first sorted and converted into the data format file sst in rocksdb through spark, and then the sst file is loaded in turn through the scheduler (rocksdb provides the ingest mechanism) to the corresponding data nodes in the RDB cluster.

bulkload方案

There are two points in this process that improve the efficiency of importing data. One is to load data in batches through files instead of calling the put interface to write single/multiple pieces of data. Second, sorting has been done when Spark converts data, which reduces data compaction within rocksdb.

Through an online real algorithm feature data, we compare the performance of bulkload method and put method one by one. The bulkload method is significantly better than the put method in terms of io pressure, read rt, and compaction volume, which is about 3 times higher. Scenario: The full amount of data is 3.8TB (2 copies total 7.6TB), 210 million incremental data is imported 300GB (2 copies total 600GB), the import time is controlled at about 100 minutes, and the read qps is 1.2w/s.

io-util对比（bulkload vs put）

平均读rt对比（bulkoad vs put）

Comparing the two compaction situations through the internal logs of rocksdb, the bulkload is 85GB (10:00 to 13:00), and the put is 273GB (13:00 to 16:00), about 1:3.2.

10:00 Cumulative compaction: 1375.15 GB write, 6.43 MB/s write, 1374.81 GB read, 6.43 MB/s read, 23267.8 seconds 
13:00 Cumulative compaction: 1460.62 GB write, 6.29 MB/s write, 1460.29 GB read, 6.29 MB/s read, 24320.8 seconds 
16:00 Cumulative compaction: 1733.60 GB write, 7.16 MB/s write, 1733.31 GB read, 7.16 MB/s read, 27675.0 seconds

Dual-version import data

On the basis of bulkload, for the scenario of overwriting the update through full import data each time, we further reduce the disk io when bulkload import data through the dual-version mechanism. A piece of data corresponds to 2 versions (areaid), which corresponds to 2 column families in rocksdb. The versions of import data and read data are staggered and switched in turn. Before importing the data, clear the data of the invalid version, which completely avoids the data merging (compaction) in rocksdb.

The dual-version mechanism uses the multi-version function of the storage agent layer. The specific scheme and details are not introduced here. In this way, the rt fluctuation of query data during data import is smaller. The following figure shows the monitoring comparison of the same data read rt in the RDB cluster and the hot and cold cluster (hBase+redis).

双版本bulkload效果对比

key-value separate storage

kv separation scheme

rocksdb merges invalid data through compaction and ensures that the data at each level is in order. The compaction process can cause write amplification problems. For long values, the write amplification problem is more serious, because the value will be read and written frequently. For the write amplification of long value, the industry already has a kv separation scheme for SSD storage, "WiscKey: Separating Keys from Values in SSD-conscious Storage" [1]. That is, the value is stored in the blob file alone, and only the position index (fileno+offset+size) of the value in the blob file is stored in lsm.

In RDB, we introduced the tidb open source kv separation plug-in, which has less code invasion to rocksdb and has a GC mechanism for invalid data recovery. The way of GC is to update the value of how much data each blob file has been effectively referenced in each compaction. If the proportion of valid data of a blob file is lower than a certain threshold (default 0.5), the valid data will be rewritten to a new file. , and delete the original file.

kv分离原理

By comparison, for long value, kv separation has different degrees of performance improvement in random write data and bulkload data import scenarios, but the cost is more disk space occupation. Due to the serious write amplification problem of random write data, the read RT can drop by 90% after KV separation. The read RT can also drop by more than 50% after the bulkload leads the data kv separation. In addition, we have measured that the effective value length threshold of kv separation is between 0.5KB and 0.7KB. The default threshold value is 1KB during online deployment. Values exceeding this length will be separated and stored in blob files.

The following picture is a scenario we tested. The average length of the value is 5.3KB, and the full data is 800GB (160 million pieces). Bulkload imports the updated data and reads the data randomly. Without kv separation, the average read rt is 1.02ms, and after kv separation, the average read rt is 0.44ms, a decrease of 57%.

kv分离读rt对比

sequence append

Based on the mechanism of kv separation, we are also exploring further innovations: implementing in-place update of values in blob files.

The source of this idea is this: some algorithmic features are stored as values in the form of sequences, such as the historical behavior of users, and the way to update is to append a short sequence to the long sequence. According to the original kv method, we need to obtain the original sequence first, then append the updated data to form a new sequence, and finally write it to the RDB. This process is redundant with the reading and writing of a large amount of data. In response to this problem, we developed an interface for serial append update.

If you simply do sequential read->append->write operations inside the RDB, there will still be a large number of disk reads and writes. So we made a modification: reserve a part of space for each value in the blob file separated by kv in advance (similar to the memory allocation of vector in STL), and write it directly to the end of the value in the blob file when the sequence appends. If this process cannot be performed (such as insufficient reserved space), the read->append->write operation is still performed.

The storage format of the sequence append update is as follows:

序列append更新存储

The detailed process of sequence append update is as follows:

序列append更新流程

At present, the sequence append function of RDB has been launched, and the effect is also very obvious. In an actual algorithm feature storage scenario, it used to take 10 hours to update a few terabytes of data each time, but now it takes 1 hour to update a few gigabytes of data each time.

ProtoBuf field update

The scheme of sequence append proved to be feasible, so we explored further extensions: supporting more general "partial update" interfaces, such as add/incr, etc.

The algorithm features kv data and value of Cloud Music are basically stored in the format of ProtoBuf (PB for short). For PDB), there will be a detailed introduction to PDB in the follow-up article, which will not be introduced here. The principle of PDB is to encode and decode PB through the engine layer, and supports update operations on fields with specified numbers, such as incr/update/add, as well as deduplication and sorting of more complex reapted fields. In this way, the process of reading->decoding->update->encoding->writing that was originally implemented at the application layer can now be completed by calling the pb_upate interface. PDB has been widely used online, so we hope to extend this set of PB update functions to the disk-based feature storage engine RDB.

At present, we have completed the development of this piece and are doing more testing. The solution is to reuse the PB update logic of the PDB, and transform the rocksdb code to realize the in-situ modification of the value after kv separation, so as to avoid redundant disk reads and writes caused by frequent compactions. The effect after going online will be resynchronized later.

The transformed rocksdb storage format is as follows:

改造后的rocksdb存储

The detailed process of PB update in RDB is as follows:

RDB中PB更新流程

Summary thinking

After more than a year, on the basic version of RDB, we customized and developed some of the above new features according to the characteristics of algorithmic feature data storage. At present, the RDB online cluster has reached a certain scale, storing tens of billions of data bars, 10 terabytes of data, and a peak QPS of one million per second.

For the self-developed features of RDB, our thinking is as follows: the underlying kernel is the transformed rocksdb (with kv separation), on which new application scenarios are customized and developed, including offline feature bulkload, real-time feature snapshot, PB field Update agreements, etc.

Of course, RDB also has some shortcomings. For example, the tair framework adopted by RDB uses hash partitioning by key. Compared with partitioning by range, it is not well supported when scanning a range of data. In addition, the data structure and operation interface supported by RDB are relatively simple. We will develop and support more functions in the future according to the business needs of feature storage, such as calculating and querying the statistical value of a time series window (sum/avg/max, etc. ). We will also combine the evolution of the internal feature platform Feature Store to build a complete set of machine learning-oriented feature storage services.

References

[1]. Arpaci-Dusseau R H, Arpaci-Dusseau R H, Arpaci-Dusseau R H, et al. WiscKey: Separating Keys from Values in SSD-Conscious Storage[J]. Acm Transactions on Storage, 2017, 13(1):5.

This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit all kinds of technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us staff.musicrecruit@service.netease.com

The practice of self-developed disk-type feature storage engine RDB in cloud music

business background

Introduction to RDB

bulkload bulk data import

bulkload scheme

Dual-version import data

key-value separate storage

kv separation scheme

sequence append

ProtoBuf field update

Summary thinking

References

云音乐技术团队

引用和评论

AI Code 在团队开发工作流的融合思考

Y 分钟速成 zfs

算力租赁：人工智能时代的“水电煤”革命——以NVIDIA 4090为例解读下一代算力解决方案

分析型数据库入门指南：如何选择适合你的实时分析工具？

湖仓一体架构解析：如何平衡数据灵活性与分析性能？

6行代码节省超千万成本——记一次字段治理的“巧渡金沙江”