author
Lv Yalin joined the Job Gang in 2019 as the head of the Job Gang Infrastructure-Architecture R&D team. During the Job Gang, he led the evolution of cloud native architecture, promoted the implementation of containerized transformation, service governance, GO microservice framework, and DevOps landing practice.
Zhang Haoran, joined the Job Helper in 2019. Job Helper Infrastructure-Senior Architect. During the Job Helper period, he promoted the evolution of the Job Helper cloud native architecture, responsible for the construction of multi-cloud k8s clusters, K8s component development, Linux kernel optimization and tuning, and underlying service containers. Related work.
background
Large-scale retrieval systems have always been the cornerstone of each company’s platform business. They are often run in a super-large cluster of thousands of bare metal servers. The amount of data is huge, and the requirements for performance, throughput, and stability are extremely demanding, and fault tolerance. Very low. In addition to the operational level, data iteration and service governance in super-large clusters and massive data scenarios are also often a huge challenge: incremental and full data distribution efficiency, short-term and long-term hot data tracking, etc., all require in-depth research Question This article will introduce the fluid computing and storage separation architecture designed and implemented by Job Help, which can significantly reduce the complexity of large-scale retrieval system services, so that large-scale retrieval systems can be managed smoothly like normal online businesses.
Problems faced by large-scale retrieval systems
The intelligent analysis and search functions of many learning materials of Jobbank rely on a large-scale data retrieval system. Our cluster size is more than 1,000 units, and the total data volume is above 100 TB. The entire system is composed of several shards. The slice is loaded by several servers with the same data set. At the operational level, we require the performance to reach P99 1.Xms, the peak throughput of 100 GB, and the stability requirement is more than 99.999%.
In the past environment, in order to improve the efficiency and stability of data reading, more consideration was given to localized storage of data. Our retrieval system generates index items every day and requires TB-level data updates. These data are output through offline database building services. After that, it needs to be updated to the corresponding shards. This mode brings many other challenges. The more critical issues are focused on data iteration and scalability:
- data set: In actual operation, each node of each shard needs to copy all the data of this shard, which brings about the problem of difficulty in synchronous data delivery. In actual operation, if you want to synchronize data to a single server node, you need to use hierarchical distribution. First, distribute the first level (tenth level) from the first level to the second level (hundred-level) and then distribute it to the third level (thousand-level). This distribution The cycle is long and requires layer-by-layer verification to ensure data accuracy.
- Business resource elastic expansion is weak. : The original system architecture adopts tight coupling of computing and storage, and data storage and computing resources are tightly bundled. The flexible expansion of resources is not high. Expansion often needs to be done in units of hours and lacks response. Capacity expansion for burst peak traffic.
- has insufficient scalability of single- : The upper limit of single-shard data is limited by the upper limit of single-machine storage in the sliced cluster. If the storage limit is reached, it is often necessary to split the data set, and this split is not driven by business requirements.
The problem of data iteration and scalability has to bring cost pressures and weaknesses in automated processes.
Through the analysis of the retrieval system operation and the data update process, the current key problem is caused by the coupling of computing and storage. Therefore, we consider how to decouple computing and storage. Only by introducing a separate architecture for computing and storage can we get Fundamentally solve the problem of complexity. The most important thing in the separation of computing and storage is to split the way that each node stores the full amount of data in the shard, and store the data in the shard on a logical remote machine, but the separation of computing and storage There are other problems, such as stability problems, reading methods and reading speeds under large amounts of data, and the degree of intrusion to the business. Although these problems exist, these problems are solvable and easy to solve based on Therefore, we confirm that the separation of computing and storage must be a good recipe in this scenario, which can fundamentally solve the problem of system complexity.
Computing and storage separation architecture solves the complexity problem
In order to solve the above-mentioned problems of computing and storage separation, the new computing and storage separation architecture must be able to achieve the following goals:
- Reading stability, computing storage separation, after all, is replaced by various components to replace the original file reading. The data loading method can be replaced, but the stability of data reading still needs to be at the same level as the original.
- In the scenario of simultaneous data update of thousands of nodes in each shard, the read speed needs to be maximized, and the pressure on the network needs to be controlled to a certain extent.
- Supports reading data through the POSIX interface. POSIX is the most adaptable way to various business scenarios, so that there is no need to invade the business scenarios and shield the upstream changes from downstream impacts.
- The controllability of the data iteration process. For online business, the data iteration should be regarded as the cd process equivalent to the service iteration. Then the controllability of the data iteration is extremely important because it is part of the cd process.
- For the scalability of data collections, the new architecture needs to be a replicable and easy-to-scalable model, so as to be able to cope with the scalability of data collections and cluster scale.
In order to achieve the above goals, we finally chose the Fluid open source project as the key link for the entire new architecture.
Component introduction
Fluid is an open source Kubernetes-native distributed data set orchestration and acceleration engine , which mainly serves data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. The data layer abstraction provided by the Kubernetes service allows data to be moved, copied, expelled, converted, and managed flexibly and efficiently between storage sources such as HDFS, OSS, Ceph and other storage sources such as HDFS, OSS, and Ceph, and Kubernetes upper-layer cloud-native application computing. The specific data operations are transparent to users, and users no longer need to worry about the efficiency of accessing remote data, the convenience of managing data sources, and how to help Kuberntes make operation and maintenance scheduling decisions.
Users only need to directly access the abstracted data in the most natural Kubernetes native data volume mode, and the remaining tasks and low-level details are all handed over to Fluid for processing. The Fluid project currently focuses on two important scenarios: data set orchestration and application orchestration.
Data set orchestration can cache the data of a specified data set to Kubernetes nodes with specified characteristics, and application orchestration will specify the application to schedule to nodes that can or have stored the specified data set. The two can also be combined to form a collaborative orchestration scenario, that is, node resource scheduling is performed in coordination with data sets and application requirements.
Why we choose to use fluid
- The retrieval service has been transformed into a container, which is naturally suitable for fluid.
- As a data orchestration system, Fluid can be used directly by the upper layer without knowing the specific data distribution. At the same time, based on the data-aware scheduling capability, it can realize the nearby scheduling of the business and accelerate the data access performance.
- Fluid implements the pvc interface, so that business pods can be mounted into the pod without perception, so that the pod can be as insensitive as using a local disk.
- Fluid provides metadata and data distributed hierarchical caching, as well as efficient file retrieval functions.
- Fluid+alluxio has a variety of built-in caching modes (back-to-source mode, full cache mode), different caching strategies (optimization for small file scenarios, etc.) and storage methods (disk, memory), which are well adaptable to different scenarios , It can meet a variety of business scenarios without much modification.
Landing practice
- Separation of cache nodes and computing nodes: Although the combined deployment of fuse and workers can achieve better local performance of data, in online scenarios, we finally chose the solution of separation of cache and computing nodes because of the extension of a certain startup time It is worthwhile to trade in better elasticity, and we don't want business node stability issues and cache node stability issues to be entangled. Fluid supports the schedulability of datasets, in other words the schedulability of cache nodes. We specify the nodeAffinity of the dataset to schedule the cache nodes of the dataset to ensure that the cache nodes can efficiently and flexibly provide cache services.
High requirements for online scenarios: For online business scenarios, given that the system has high requirements for data access speed, integrity, and consistency, partial data updates, unexpected back-to-origin requests, etc. cannot occur; therefore, data caching And the choice of update strategy will be critical.
- Appropriate data caching strategy : Based on the above requirements, we choose to use Fluid's full cache mode. In the full cache mode, all requests will only go to the cache instead of returning to the data source, thus avoiding unexpectedly long time-consuming requests. At the same time, the dataload process is controlled by the data update process, which is safer and more standardized.
- update process combined with permission flow : The data update of online business is also a type of cd, and it also needs to be controlled by the update process. Through the dataload mode combined with the permission process, online data publishing is more secure and standardized.
- data update : Since the model is composed of many files, only after all the files are cached, is a complete model that can be used; so under the premise that the full cache has no back to the source, it is necessary To ensure the atomicity of the dataload process, the new version data cannot be accessed during the data loading process, and the new version data can be read only after the data loading is completed.
The above schemes and strategies cooperate with our automated database construction and data version management functions, which greatly improves the safety and stability of the overall system, and at the same time makes the flow of the entire process more intelligent and automated.
Summarize
Based on Fluid's separation of computing and storage architecture, we have successfully achieved:
- Data distribution at the minute level and one hundred T level.
- The atomicity of data version management and data update makes data distribution and update a manageable and smarter automated process.
- The retrieval service can be like a normal stateless service, so that it can easily achieve horizontal expansion through TKE HPA, and faster expansion and contraction brings higher stability and availability.
Outlook
The model of separation of computing and storage allows us to think that very special services can be stateless and can be included in the Devops system like normal services, while the fluid-based data orchestration and acceleration system is a separate practice of computing and storage. In addition to being used in the retrieval system, we are also exploring the model training and distribution model of the Fluid-based OCR system.
In terms of future work, we plan to continue to optimize the scheduling strategy and execution mode of upper-level jobs based on Fluid, and to further expand model training and distribution to improve overall training speed and resource utilization. On the other hand, we will also help the community continue to evolve its observability. And high availability, etc., to help more developers.
about us
For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~
Welfare:
①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~
②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.
③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。