OPPO is a smart terminal manufacturing company with hundreds of millions of terminal users, and generates a large amount of unstructured data such as text, pictures, audio and video every day. Under the premise of ensuring data connectivity, real-time and data security governance requirements, how to fully tap the value of data at low cost and high efficiency has become a major problem for companies with massive amounts of data. The current popular solution in the industry is the data lake. The OPPO self-developed data lake storage CBFS introduced in this article can largely solve the current pain points.
▌A brief description of the data lake
Data lake definition: a centralized storage warehouse that stores data in its original data format, usually binary blobs or files. A data lake is usually a single data set, including raw data and transformed data (reports, visualization, advanced analysis, machine learning, etc.)
1. The value of data lake storage
Compared with the traditional Hadoop architecture, the data lake has the following advantages:
- Highly flexible: data reading, writing and processing are very convenient, and all original data can be saved
- Multiple analysis: supports multiple loads including batch, stream computing, interactive query, machine learning, etc.
- Low cost: independent expansion of storage and computing resources; object storage is used, cold and hot separation, lower cost
- Easy to manage: complete user management authentication, compliance and auditing, data "custodial use" can be traced throughout the entire process
2. OPPO Data Lake Overall Solution
OPPO mainly builds a data lake from three dimensions: the bottom layer of lake storage, we use CBFS, which is a low-cost storage that supports S3, HDFS, and POSIX file access protocols at the same time; the middle layer is real-time data Storage format, we use iceberg; the top layer can support a variety of different computing engines
3. OPPO Data Lake Architecture Features
The feature of early big data storage is that the storage of stream computing and batch computing are placed in different systems. The upgraded architecture unified metadata management, batch and stream computing integration; at the same time, it provides unified interactive query, and the interface is more friendly, seconds. Level response, high concurrency, and support for data source Upsert change operations; the bottom layer uses large-scale low-cost object storage as a unified data base, supports multi-engine data sharing, and improves data reuse capabilities
4. Data Lake Storage CBFS Architecture
Our goal is to build a data lake storage that can support EB-level data and solve the cost, performance and experience challenges of data analysis. The entire data lake storage is divided into six subsystems:
- Protocol access layer: supports a variety of different protocols (S3, HDFS, Posix files), can be used to write data using one of the protocols, and the other two protocols can also be directly read
- Metadata layer: externally presents the hierarchical namespace of the file system and the flat namespace of objects. The entire metadata is distributed, supports fragmentation, and is linearly scalable
- Metadata cache layer: used to manage metadata cache and provide metadata access acceleration capabilities
- Resource management layer: The Master node in the figure is responsible for the management of physical resources (data nodes, metadata nodes) and logical resources (volumes/buckets, data shards, metadata shards)
- Multi-copy layer: Supports additional writing and random writing, which is more friendly to both large and small objects. One function of this subsystem is to serve as a persistent multi-copy storage; the other function is a data caching layer, supporting elastic copies, accelerating data lake access, and subsequent deployment.
- Erasure code storage layer: can significantly reduce storage costs, support multi-zone deployment, support different erasure code models, and easily support EB-level storage scale
Next, we will focus on sharing the key technologies used in CBFS, including high-performance metadata management, erasure code storage, and lake acceleration
▌CBFS key technology
1. Metadata Management
The file system provides a hierarchical namespace view. The logical directory tree of the entire file system is divided into multiple layers. As shown on the right, each metadata node (MetaNode) contains hundreds of metadata fragments (MetaPartition), and each A shard is composed of InodeTree (BTree) and DentryTree (BTree). Each dentry represents a directory item, and the dentry is composed of parentId and name. In DentryTree, the index is composed of PartentId and name for storage and retrieval; in InodeTree, the index is based on the inode id. The multiRaft protocol is used to ensure high availability and data consistent replication, and each node set will contain a large number of shard groups, and each shard group corresponds to a raft group; each shard group belongs to a certain volume; each shard The group is a section of metadata range (a section of inode id) of a certain volume; the metadata subsystem completes dynamic expansion through splitting; when the resources (performance, capacity) of a shard group are immediately adjacent to the value, the resource manager service will Estimate an end point and notify this group of node devices to only serve the data before this point. At the same time, a new group of nodes will be selected and dynamically added to the current business system.
A single directory supports a million-level capacity, and metadata is fully memorized to ensure excellent read and write performance. Memory metadata fragments are persisted to disk through snapshots for backup and recovery.
Object storage provides a flat namespace; for example, to access the object whose objectkey is /bucket/a/b/c, starting from the root directory, through layer-by-layer analysis of the "/" separator, find the last directory (/bucket /a/b) Dentry, and finally found /bucket/a/b/c for Inode, this process involves multiple interactions between nodes, the deeper the level, the poorer performance; therefore, we introduce the PathCache module to accelerate ObjectKey analysis , The simple way is to cache the Dentry of the parent directory of ObjectKey (/bucket/a/b) in PathCache; analyzing the online cluster, we found that the average size of the directory is about 100, assuming the storage cluster size is at the level of 100 billion, directory entries Only 1 billion, the single-machine cache efficiency is very high, and the read performance can be improved by multiple nodes; while supporting the design of "flat" and "level" namespace management, compared with other systems in the industry, CBFS achieves More concise and more efficient, a piece of data can be easily realized without any conversion, multiple protocol access and intercommunication, and there is no data consistency problem.
2. Erasure code storage
One of the key technologies to reduce storage costs is erasure code (EC). A brief introduction to the principle of erasure code: k pieces of original data are calculated by encoding to obtain new m pieces of data, when k+m pieces of data When any number of copies is not more than m lost, the original data can be restored by decoding (the principle is a bit like a disk raid); compared to traditional multi-copy storage, EC has lower data redundancy, but data durability (durability) is better High; there are many different ways to achieve it, most of which are based on XOR operation or Reed-Solomon (RS) coding, our CBFS also uses RS coding
1. Coding matrix, the upper n rows are the unit matrix I, and the lower m rows are the coding matrix; a vector composed of k+m data blocks, including the original data blocks and m check blocks
2. When a block is lost: delete the row corresponding to the block from matrix B to obtain a new matrix B', and then multiply the left side by the inverse matrix of B'to recover the lost block. The detailed calculation process can be read offline Relevant information
Common RS encoding has some problems: the above figure is an example, suppose X1～X6, Y1～Y6 are data blocks, P1 and P2 are parity blocks, if any of them is lost, you need to read the remaining 12 blocks to repair the data. The IO loss is large, and the bandwidth required for data repair is high. The problem is particularly obvious when deploying in multiple AZs;
The LRC code proposed by Microsoft solves this problem by introducing local check blocks. As shown in the figure, on the basis of the original global check blocks P1 and P2, two local check blocks PX and PY are added, assuming that X1 is damaged. , Just read the 6 blocks associated with X1～X6 to repair the data. Statistics show that in a data center, the probability of a single disk failure in a strip within a certain period of time is 98%, and the probability of two disks being damaged at the same time is 1%. Therefore, LRC can greatly improve the efficiency of data repair in most scenarios, but it has disadvantages. It is the non-maximum distance separable coding, which cannot achieve the loss of any m pieces of data like the global RS coding, and all the lost can be repaired.
1. Offline EC: After the k data units of the entire strip are filled, the overall calculation generates m check blocks
2. Online EC: After receiving the data, split it synchronously and calculate the check block in real time, and write k data blocks and m check blocks at the same time
CBFS cross-AZ multi-mode online EC
CBFS supports online EC storage across AZ multi-mode strips. For different computer room conditions (1/2/3AZ), objects of different sizes, different service availability and data durability requirements, the system can be flexibly configured with different encoding modes
Take the "1AZ-RS" mode in the figure as an example, 6 data blocks plus 3 check blocks are deployed in a single AZ; 2AZ-RS mode, 6 data blocks plus 10 check blocks are used for 2AZ deployment, with data redundancy 16/6=2.67; 3AZ-LRC mode, using 6 data blocks, 6 global check blocks plus 3 local check blocks mode; the same cluster supports different coding modes at the same time.
Online EC storage architecture
Contains several modules
Access: Data access access layer, while providing EC coding and decoding capabilities
CM: The cluster management layer, manages nodes, disks, volumes and other resources, and is also responsible for migration, repair, balancing, and inspection tasks. The same cluster supports the coexistence of different EC encoding modes
Allocator: responsible for volume space allocation
EC-Node: stand-alone storage engine, responsible for the actual storage of data
Erasure code writing
1. Streaming data collection
2. Generate multiple data blocks for data slices, and calculate check blocks at the same time
3. Apply for storage volume
4. Concurrently distribute data blocks or check blocks to various storage nodes
Data writing adopts a simple NRW protocol to ensure the minimum number of writing copies, so that when the normalized node and network fail, the request will not be blocked and the availability is guaranteed; the data receiving, segmentation, and check block coding are adopted The asynchronous pipeline mode also guarantees high throughput and low latency.
Erasure code reading
The data reading also adopts the NRW model, taking k=m=2 coding mode as an example. As long as two blocks (whether it is a data block or a parity block) are read correctly, the original data can be obtained by fast RS decoding calculation; in addition, To improve availability and reduce latency, Access will prioritize access to nearby or low-load storage nodes EC-Node
It can be seen that the online EC combined with the NRW protocol ensures strong data consistency and at the same time provides guarantees for high throughput and low latency, which is very suitable for the big data business model.
3. Data Lake Access Acceleration
One of the significant benefits brought by the data lake architecture is cost savings, but the storage-computing architecture will also encounter bandwidth bottlenecks and performance challenges. Therefore, we also provide a series of access acceleration technologies:
The first is the multi-level cache capability:
1. The first level cache: local cache, which is deployed on the same machine as the computing node, supports metadata and data cache, supports different types of media such as memory, PMem, NVme, and HDD, and is characterized by low access latency but low capacity
2. Second-level cache: distributed cache, the number of copies is flexible and variable, provides location awareness, supports active warm-up and passive cache at the user/bucket/object level, and the data elimination strategy can also be configured
Multi-level caching strategy has a good acceleration effect in our machine learning training scenario
In addition, the storage data layer also supports predicate pushdown operations, which can significantly reduce the large amount of data flow between storage and computing nodes, reduce resource overhead and improve computing performance;
There is still a lot of meticulous work to accelerate the data lake, and we are also in the process of continuous improvement
The CBFS-2.x version is currently open source, and version 3.0 that supports key features such as online EC, lake acceleration, and multi-protocol access is expected to be open source in October 2021;
Subsequent CBFS will add features such as direct mounting of the stock HDFS cluster (no data relocation), intelligent layering of hot and cold data, etc., to support the smooth entry of the stock data under the original architecture of big data and AI.
About the Author:
Xiaochun OPPO Storage Architect
For more exciting content, please pay attention to the [OPPO Digital Intelligence Technology] public account