author
Zhu Jianping, Deputy Director of TEG/Cloud Architecture Platform Department/Block and Table Storage Center. After joining Tencent in 2008, he has been responsible for object storage, key-value storage, and successively responsible for KV storage-TSSD, object storage-TFS and other storage platforms.
NoSQL technology and industry background
NoSQL is a general term that is different from traditional relational databases. The original intention of proposing NoSQL is to simplify the design of relational databases for certain scenarios, make it easier to horizontally expand storage and computing, and focus more on achieving high concurrency, high availability, and high scalability. .
NoSQL vs relational databases
In fact, the difference between the two is clear in the early years. A relational database is a two-dimensional table with a row-column structure and a predefined scheme, which is operated by SQL statements; NoSQL is a Key-Value store, which is a distributed Hash Map. of storage. But it's been a bit unclear in recent years? Mainly, some NoSQL products have also begun to enhance their capabilities in SQL interfaces and transactions. For example, Cassandra supports CQL, DynamoDB supports PartiQL, and InfluxDB also supports InfuxQL. My opinion here is, the key difference between NoSQL vs relational databases: relational databases have powerful ACID transactions, complex SQL retrieval, data integrity constraints, etc., which give it great ease of use, but also It realizes the constraints of high concurrency, high availability and high scalability; NoSQL has made a trade-off in engineering implementation, weakening or even giving up its ability in dimensions such as cross-partition transactions and distributed JOIN, and enhancing its high concurrency and high availability. and high scalability.
Data Model for Multi-Model NoSQL
Multi-model in multi-model NoSQL refers to the inclusion of multiple data models: key-value model Key-Value, wide table model Wide-column, document model Document, time series model Time-series, graph model Graph and memory model in-memory, etc. . We can simply understand that Key-Value is a hash table, Wide-column is a multi-dimensional hash table that is Key-Key-Value structure, Document Document is a nested tree structure similar to Json structure, Graph is based on vertices and edges Composed of complex graph structures, Time-series is a retrieval table ordered by time.
Data model usage and development can be viewed in terms of popularity and growth rate. Popularity reflects the cumulative effect of application promotion. The top three are document > key value > wide table; growth rate represents the response to future demand, and the top three are time series > key value > graph, in which time series and graph are obtained. Due to the rapid growth of demand for IoT LoT and real-time computing, some foreign NoSQL startups have recently focused on time series and graph storage related fields.
Industry Players in NoSQL Storage
It is mainly divided into three categories: open source communities in vertical fields, multi-model NoSQL companies and public cloud vendors.
Open source communities in vertical fields , including Redis in key-value storage field, MongoDB in document storage field, InfluxDB in time series storage field, Neo4j in graph storage field, etc. These companies are all winners who have emerged from the competition of vertical open source communities for many years and have mastered the Ecology and interface standards in the vertical field, based on the public cloud to carry out multi-cloud enterprise services.
Multi-model NoSQL companies , such as YugabyteDB, Aerospike, etc., although they are also open source, are also based on the public cloud to support multi-cloud enterprise services, but they do not master the vertical field ecology and interface standards, and are more compatible with Redis, Cassandra, PG (PostgreSQL), etc. Interface standards to integrate into the existing ecology.
Public cloud vendors , such as Microsoft Azure CosmosDB, Amazon AWS DynamoDB, etc., provide cloud-native managed storage services, and use custom or directly compatible interfaces in vertical fields such as Redis and Cassandra in the open source community.
And our NoSQL falls into the third category of players here.
According to public market data, these three types of manufacturers have had relatively good market growth in recent years, but there are also some contradictions and competitions between vertical fields, open source communities and public cloud manufacturers.
The development direction and trend of NoSQL storage
The company's self-developed NoSQL originated from the custom development combined with business scenarios in the early years. For example, CKV+, TSSD, PCG's BDB, Grocery, etc. in our oTeam. However, in cloud-native scenarios, new software and hardware infrastructure upgrades, and extended support for new scenarios also face new challenges, as well as the inability to take into account the demands of internal self-use and external customers on the cloud at the same time.
First, in the cloud native scenario, customers have put forward higher requirements for self-research . For example, to remove the binding of cloud vendors is to adopt industry API interface standards, support multi-availability zone and geographical distribution, elastic scaling, pay-as-you-go containerization, and distributed cloud deployment.
Second, continuously improving infrastructure capabilities place higher requirements on underlying storage . In the past few years, the capabilities of the company's computer room, network environment, microservice framework, system, software and other infrastructure have been greatly improved, such as the capacity of a single SSD and the number of disks configured on a single storage server. The growth of new TRPC framework, new network and new disk IO channels, such as RDMA/DPDK, SPDK/IO_URING and other capabilities, all require the underlying storage architecture to be continuously adapted to obtain higher cost performance.
Finally, nascent scenarios such as personalized content recommendation and IoT monitoring emerge . Compared with the previous key-value storage scenarios in social networks, new scenarios such as feature storage in personalized content recommendation and time series storage in IoT/monitoring have also appeared in recent years. The storage engine and other aspects are different from the previous use of key-value storage. It needs to be able to reuse most of the capabilities of the platform, and it also needs to be able to customize some components.
In order to cope with new opportunities and challenges, we have combined PCG, CSIG, WXG and IEG related teams to form a multi-model NoSQL Oteam in 2021 to support new business scenarios. Through the joint efforts of all parties in oTeam, a multi-model NoSQL platform (X-Stor) has been developed from scratch, and the initial construction of the platform's technical capabilities and large-scale operation capabilities has been completed.
Re-engineering--multi-model NoSQL system architecture
Multi-model NoSQL architecture and goals
There are two core goals of multi-model NoSQL: one is to provide a stable and powerful platform base for different expansions and reuse; the other is to provide the ability to quickly adapt for customized business development or expansion of new scenarios.
Platform base, including online access related and management related. An online access to the relevant part, providing a highly scalable data processing framework, including the ability to support multiple data consistency, data partition and multi-AZ/Region data copy replication, data tiering, and indexing and transaction capabilities. The second part related to management and control provides the workflow engine WorkFlow, and based on this workflow engine, it realizes operation management and control capabilities such as resource management, data migration, data backup, fixed-point rollback, and data inspection.
Quick adaptation, including scalable multi-model API and storage engine framework. Extensible multi-model API, which is convenient for collaborating parties to customize the access protocol according to the needs of business scenarios. The current API interface already supports the interfaces and functions of existing key-value storage platforms such as TSSD/BDB/Grocery, as well as some Redis interfaces. The framework of the storage engine makes it easy to customize your own storage engine according to business scenarios, and make trade-offs and balances in terms of memory usage and disk IO resources. Currently supported RocksDB storage engine of LSM-Tree, FasterKV engine based on Hash and time series TSDB storage engine based on TSM-Tree.
Multi-Model NoSQL Resource Concepts
Multi-model NoSQL resource concept, we divide it into user resource and physical resource.
User resources are logical resources created by users, including Account, Keyspace, Collection, Partition, and Replica. What everyone is relatively unfamiliar with here may be the concept of Account. The multi-model NoSQL Account is not mainly designed for billing purposes. It is different from Tencent Cloud accounts or accounts in the OBS system billed within the company. The main purpose is to facilitate customers to configure the public properties of the Collection and the underlying correlation according to the Collection. To share resources, such as the entrance of the North Star associated with the access machine, and even schedule them together with the Replica copy under the account, which is convenient for multi-tenant isolation and reuse at the resource level.
Physical resources are managed server resources. Currently, we apply for storage servers and access/logical TKE containers. For containerization of storage servers, for example, for a storage server configured with 12 SSDs, we create 12 TKE containers and associate each container with an SSD disk, which we call a Pod, and accordingly this storage The server we call a Node. According to the physical distribution of the hardware, we give each PoD the associated region attribute Region, the cluster attribute Cluster, and the subcluster attribute Subcluster Group, which we call SCG, and the subcluster attributes Subcluster Region, Subcluster Group, and Subcluster are included layer by layer. a relationship. SCG forms a node group with a specified number of Nodes, and Subcluster is a disk group formed by adding some disks of SCG. By distributing multiple copies of a Partition within these groups, it is convenient and effective to manage the risks brought by multiple node or disk failures at the same time, and at the same time, we can control the explosion radius and influence radius when the failure occurs. Interested students can google the CopySet paper online to learn more about its principles.
Multi-model NoSQL module structure
In the modular architecture of multi-model NoSQL, we are divided into data plane and control plane. The data plane mainly refers to the modules on the online add/delete/modify/check request path of the business; the control plane is the business console, operation and maintenance system, or the modules involved in the maintenance and processing of internal scheduled tasks.
The data plane module architecture adopts a two-layer design to facilitate the management and control of long-tail delay and cost. The backend to serve a request can go through at most two hops. Three types of request processing are designed for business scenario requirements.
Conventional path (mark ①), the request arrives at the gateway of the access layer , which queries the meta-information cached locally, and forwards the request to the underlying storage module cell. The deployment of an independent gateway is convenient for converging the number of front-end network connections, and the business front-end does not support customization SDK scenarios.
Cache path (mark ②), the request reaches the access layer cache , it queries the local memory cache, if it hits, it returns directly. If it does not hit, it accesses the underlying storage module cell to query to obtain and notify the cache master node. The configured cache policy updates the cache and synchronizes updates to all cache slaves based on the coherence protocol. It is convenient to reduce business requests with obvious hotspot effects and reduce access costs.
The customized path (mark ③) allows the client to directly connect to the storage node through the customized sdk to realize one-hop access. At the same time, some computing functions can be offloaded to the client for execution, which is beneficial to reduce the access delay and reduce the computing cost.
The control plane module structure , the access entrance outside the control plane has two access gateways and three internal parts. The access gateways are userAdmin and sysAdmin, userAdmin is the gateway for customer console API access, and sysAdmin is the gateway for operation and maintenance system access. The three internal parts are metadata storage and distribution, workflow and monitoring.
Metadata storage and distribution, mainly resource management services, including resource management services (RM) and resource management cache services (RMC) . Metadata adopts distributed strong consistent storage. Currently, five copies are stored in CMongo. In the future, closed-loop storage will be considered in its own system. RMC is designed to facilitate the distribution of metadata. After the gateway and cache services of the data plane are started, they will be registered with RMC, which is convenient for RMC to perform metadata increment, push, distribution and one-time verification, and access through userAdmin and sysAdmin gateways RM realizes the update of metadata. RMC perceives the change of this data through the update stream, and infers the update of metadata to the gateway and cache container registered to itself.
Workflow, in the practice of storage management and control in the past, is usually based on the micro-service architecture to design data migration services, data inspection services, data scheduling services, capacity acquisition services, data cold backup services, resource storage services and many other independent modules. Implement storage management . Although good scalability is achieved, many modules will increase the cost of development, maintenance, operation, and management. In X-Stor, we design the Workflow framework, which configures the assembly process in the form of building blocks to achieve reentrant execution. Through the Workflow framework, combined with containerized deployment, a single Workflow service is used to achieve all the above functions, and at the same time, automatic scaling and fault tolerance are also very easy to implement for all Workflow execution, log archiving, and auditing capabilities.
Monitoring, by deploying NodeAgent locally on each server Node, collects the status information of each container on this Node in real time, and pushes it to the Monitor service of the cluster . The Monitor service is connected to Prometheus's storage, cluster scheduling service, monitoring and alarm components such as TEG Zhiyan monitoring treasure, etc., which can easily realize real-time scheduling according to the cluster and customize a visual Dashboard based on Grafana.
Cloud native capability design and thinking
Scalable and cloud-native are two goals we have in mind when designing multi-model NoSQL. The related architecture content to achieve scalability is described above. The system helps scalability and supports multiple data accesses, APIs and storage engines to achieve multi-model storage. Next, I want to share my design thinking on cloud native.
The word cloud native is a concept that everyone has often heard in recent years, but when you Baidu this concept, you find it difficult to understand it clearly. In my personal understanding, there are two concepts at the core of cloud native, cloud native products and cloud native technologies. Cloud-native products refer to the capabilities and requirements of products that provide services in the cloud from the perspective of customers in the context of the popularization of public clouds, such as elastic scaling and observability. Cloud-native technologies are the technical means to help implement cloud-native products such as containers, service meshes, microservices, immutable infrastructure, and declarative APIs. From the beginning of the design of multi-model NoSQL, we have closely integrated with related native technologies and considered the capabilities based on cloud native. Our cloud-native features focus on four aspects: openness, elastic scaling, pay-as-you-go, multi-AZ and Region data distribution.
01 Openness
The openness of multi-model NoSQL is mainly reflected in the following three dimensions.
First, the opening of interfaces and functions. Customers put forward multi-cloud demands due to considerations such as cost and fault tolerance. They require breaking vendor lockin for cloud products and requiring products that can be migrated between different cloud vendors. Cloud-native products need to respect this consideration. We gave up locking custom private protocols and interfaces, and turned to be fully compatible with vertical community software interfaces and functions, such as Redis, InfluxDB, etc. In the future, we will further complement the data migration DTS capability.
Second, it supports extended and open interconnected connectors. Continuously enrich the interconnection with other cloud native products on the public cloud. For example, the data mirroring, backup and update streams that we have supported are stored in the object storage product COS or other products compatible with the S3 interface, and the update streams can be imported into our Kafka In the queue, more connectors may be launched in the future, which can connect to related cloud products.
Finally, at the resource level, the product is deployed without locking down specific hardware resources. We took the lead in realizing the complete architecture from access to storage in the containerized environment of K8S, which can support the deployment of multi-cloud and distributed clouds in terms of capability.
02 Elastic scaling
Elastic scaling is a very important capability of cloud-native products, solving some of the bottlenecks faced by self-developed software architecture or resources in the past. Multi-model NoSQL achieves elastic scaling in terms of client resources, server or container resources.
First, through the distributed strong consistent storage and distribution architecture, it provides powerful metadata storage and access capabilities, and supports the scalability of the number of user database tables and the capacity of a single database table; through the horizontal scalability architecture and underlying scheduling capabilities, it supports A single table scales infinitely horizontally in terms of storage and access capacity.
Secondly, through the containerization and standardization of resources, it is possible to apply and release container resources from the company's large resource pool in real time, so that we can quickly meet the requirements of the business in terms of resource specifications, resource quantity and resource distribution in the computer room.
Finally, in terms of the speed and efficiency of scaling, with the help of the aforementioned data copy distribution strategy and real-time data collection scheduling, extremely fast capacity expansion and automatic scaling are realized, with vertical scaling in less than 10 seconds and 4TB horizontal scaling in less than 5 minutes.
03 Pay as you go
Pay-as-you-go is the key capability of cloud-native products to help customers achieve low-cost operations . In this regard, we have mainly achieved two capabilities.
One is separate billing for storage and computing . Customers do not need to choose from several containers of predetermined specifications. Customers only need to focus on storage capacity and computing capacity. The bottom layer intensively manages the Buffer Pool reserved for each library table, and through multi-tenant technology and packing scheduling, Improve the overall utilization of resources, and help customers save operating costs through our resource pool management and resource utilization improvement.
Second, be flexible . Through convenient and flexible selection in the customer console/API, rather than rigid bundling/anchoring, it can meet the needs of business scenarios to achieve the highest cost performance. For example, we can configure flexibly in terms of data consistency, number of copies of data, distribution of multiple regions, data life cycle, and even storage media. In terms of resource exclusiveness, it balances cost and performance, and in terms of exclusive and mixed use of storage machines and access machines, it can be configured independently.
04 Multiple AZ and Region distribution
Multi-AZ and Region distribution are the basic requirements for cloud-native products to achieve high availability and data reliability .
AZs and Regions in the public cloud are not exactly the same as our common concepts of computer rooms and cities. As multiple AZs under a Region are required to be 30 to 100 kilometers apart, the RTT is generally within 0.5 to 2 milliseconds. The physical distance between different regions is generally more than 100 kilometers. X-Stor realizes the ability to distribute data in multiple AZs and Regions by building AZ and Region attributes for resources, combined with cluster scheduling, data synchronization and other support, and can achieve nearby access based on the business's own data consistency requirements; At the same time, it is also planned to support Multi-Master on the basis of multi-Region distribution. For a collection distributed in multiple regions, it can be written in any region nearby. Internally, we copy the data between regions and solve the problem of concurrency conflicts to further optimize the experience of writing delay.
about us
For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~
Welfare:
① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~
②The official account will reply to [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.
③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"
④ Reply to [Introduction to the Speed of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.
[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。