Abstract: makes the database performance UP! UP! UP! Huawei Cloud GaussDB (for MySQL) does this
This article is shared from the Huawei Cloud Community " Huawei Overseas Female Scientists Revealed for You: How powerful is the vertical integration of GaussDB (for MySQL) cloud stack? ", author: Lu Manyi, chief scientist of the database Lab of the Swedish Research Institute, Huawei.
Nowadays, cloud computing is becoming more and more popular. As an important service model of cloud computing, one of its key components is transaction database service. In actual business scenarios, applications rely on scalable, high-performance managed database services to fully benefit from the cloud platform. The cloud database also needs to efficiently use the underlying cloud infrastructure to release the potential of cloud-scale operations.
Huawei Cloud GaussDB (for MySQL) is Huawei's new-generation DFV (full name Data Function Virtualisation) distributed storage based on Huawei. It adopts a separate computing and storage architecture and is fully compatible with MySQL's high-performance enterprise-class cloud-native distributed database. The form of hosting service provides professional services for Internet and corporate customers. In this article, I will explain common customer workloads and how we can use the unique capabilities of the Huawei cloud computing stack to handle such workloads.
Who are the customers on the cloud? What is their workload?
In China, people usually think that only Internet start-ups will use cloud platforms, and MySQL is favored by many domestic companies because of its popularity among Internet companies. But in fact, companies have begun to embrace the cloud concept many years ago, and are constantly developing in depth, which is also the current trend in China. As the most popular open source database in the world, MySQL is widely used in all industries and Internet companies.
So, what are the typical workloads of cloud database customers? The two characteristics we have observed are: 1) The amount of data is getting larger and larger. From the beginning, there are several terabytes to dozens of terabytes of data, and as time goes by, the amount of data will become larger and larger. 2) A mixture of simple insert/delete/update/point-check and complex analysis query. In addition, there are occasional DDL operations.
At present, customers are facing a big challenge, that is, how to improve database performance under the condition of large amount of data. Customers want to maintain the throughput of core transaction workloads while performing complex queries. Because of the business logic nature of the enterprise, queries are usually more complicated. Fortunately, MySQL 8.0 adds long-awaited SQL support for analysis rows, such as windowing function and recursive CTE. For unstructured data, MySQL's JSON support has become very popular.
GaussDB (for MySQL) architecture overview
The GaussDB (for MySQL) architecture is built on a distributed storage system shared by multiple tenants. Currently, the maximum data volume of a database is 128TB, one master node is used for read and write loads, and a maximum of 15 read-only nodes are used for read loads. The SQL engine is a deeply modified MySQL 8.0, so it is 100% compatible with MySQL in terms of syntax and semantics. RDMA network is used between computing node and storage.
The storage system used by the GaussDB (for MySQL) service is a highly reliable cross-AZ cloud storage. On the public cloud, the storage system can be a large cluster with dozens or hundreds of nodes, and its horizontal scalability is many times higher than that of a single-tenant offline solution. The SQL node writes the redo log to the storage layer, and the page materializes at the storage layer. This design significantly reduces network communication for update-intensive workloads. Pages belonging to a single database are organized in slices, and slices are distributed on multiple storage nodes. This data distribution is the basis of distributed queries.
HUAWEI CLOUD GaussDB (for MySQL) architecture diagram
Huawei's unique advantage: vertical integration
Unlike traditional offline databases, cloud databases have the ability to vertically integrate all layers in the cloud stack. As a leading provider in all layers of the cloud stack, Huawei has a unique position in the cloud field and has the ability to become a leader in the industry.
The closest thing to the database in the cloud stack is storage. Offline pure soft databases need to be used in conjunction with general-purpose storage and standard file systems. In terms of vertical integration, there is almost no room for optimization. But on the cloud, the integration of storage and database can play a greater role, because cloud storage is highly scalable in storage nodes and allows customers to dynamically expand according to data volume and load. Since cloud storage is shared among multiple tenants, and not all tenants will have large-scale scans at all times, we can offload part of the query processing to the storage layer to achieve higher resource utilization.
- Improve performance through parallelism (Parallel Query: PQ)
A common way to improve performance is parallelism, which can be implemented on multiple layers. The community version of MySQL 8.0 only supports single-threaded query execution, and cannot make full use of all the cores provided by the hardware to execute complex queries. We modified the MySQL executor to allow multiple threads to execute a single query in parallel. Unlike offline solutions, cloud infrastructure allows us to leverage its vertical expansion capabilities on computing nodes. The largest computing node currently has 64 cores, which also represents the maximum parallelism we can achieve through parallel queries. This optimization works best when most of the hot data can be placed in the buffer pool. Parallel query will be explained in detail in another article.
Customer workload includes not only DML, but also DDL, such as index creation and changing column data types. Although most DDL is processed online in MySQL, some operations may be blocked, and the use of logical replication will expand the blockage. GaussDB (for MySQL) uses physical replication to avoid this problem. When the table is large, the DDL operation may take several hours to complete. In order to support our common data volume on the cloud, the need to optimize DDL is obvious. We already have an innovative way to deal with DDL, this innovation will be discussed in a later article.
Another layer that allows greater parallelism is the storage layer, because the storage system may have hundreds of nodes and thousands of cores. The cloud-scale distributed storage used by GaussDB (for MySQL) is a key foundation for us to improve query performance. Combined with parallel query, it is possible to achieve a query performance improvement of more than 100 times.
Single-threaded query execution at the computing layer
Parallel execution of computing layer and storage layer
- Use cloud storage to improve query performance (operator push-down: NDP)
The data in GaussDB (for MySQL) is organized in slices and distributed on multiple storage nodes. We use this data distribution to offload the operator to the storage node where the data is located, and execute it using locally available computing resources without reading the data to the computing node. In database terminology, we call it Near Data Processing (NDP) or operator push-down . The basic principle is: push down part of the query processing work to the storage node where the data is located. The pushed down query is a data-intensive query, such as full table scan and index scan, projection and filtering of certain WHERE conditions, and Aggregation is performed at the storage layer, and only matching rows and columns are returned to the compute node instead of the complete page. In addition to parallel execution, this method also reduces network IO because the amount of data extracted to the computing node is significantly reduced.
In addition, the NDP operator push-down also allows full use of the local bandwidth of the cache and storage media. When the query needs to scan a large amount of data, and the data is not in the Innodb buffer pool, the effect of offloading to storage is the best. For example, the following figure shows that NDP operator push-down and parallel query optimize the execution time of TCP-H Q12 by 34 times. Another article will separately introduce the technical details of NDP and provide a comprehensive performance analysis.
Future direction
GaussDB (for MySQL) is designed for the cloud. This architecture has extremely powerful and flexible vertical integration capabilities. Computing and storage resources are decoupled and can be expanded independently. At the same time, they are tightly integrated in function. Database operations can be in multiple layers. implement. In the future, database functions can also be offloaded to network cards and other cloud components, not limited to computing nodes and storage.
We believe that the deep integration of the cloud stack is the key to unlocking the power of cloud databases. Huawei is in a unique position to achieve this goal. As shown by GaussDB (for MySQL), it will lead the cloud field in the future.
In summary, HUAWEI CLOUD GaussDB (for MySQL) is based on the storage-computing separation architecture, and through advanced technologies such as parallel query PQ and operator push-down NDP, it greatly improves database performance and maximizes the vertical integration power of the cloud stack. Make computing power faster and stronger. The functions described in the article are all online, welcome to official website of Huawei Cloud to experience: 1612703b841788 https://www.huaweicloud.com/product/gaussdb_mysql.html, please continue to follow us, and there will be more technical information to share with you in the future !
Lu Manyi is currently the chief scientist of the Database Lab of Huawei Sweden Research Institute and the head of the European R&D team of cloud database. He has more than 20 years of experience in the database field, has participated in the development of distributed high-availability databases in the telecommunications industry, and has cultivated MySQL technology for ten years in an internationally renowned software company. Joined Huawei in 2020, determined to build the world's top enterprise-level cloud database.
Huawei will host Huawei Full Connect 2021 at the Shanghai World Expo Center & Expo Exhibition Hall on September 23-25, 2021. With the theme of "Deepening Digitalization", it will bring together industry thought leaders, business elites, technology giants, pioneers, and ecological partners. , Application service providers, developers and other parties to discuss how to in-depth industry scenarios, integrate digital technology with industry knowledge, truly integrate into the main business process of government and enterprises, solve core business problems, and promote experience improvement, efficiency improvement and model innovation; And release scenario-based products and solutions, share the latest achievements and practices of customer partners, and build an open and win-win healthy ecology. For more information, please visit the official website www.huawei.com/hc2021
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。