Introduction to This content is the 2021 Cloud Home Conference-Cloud Native Data Warehouse AnalyticDB Technology and Practice Summit Sub-forum, Alibaba Cloud Database senior technical expert Yao Yiwei shared on "AnalyticDB MySQL Off-line Integration Technology Revealed".
more cutting-edge sharing, click Conference video playback link to get it.
This article will introduce AnalyticDB MySQL off-line integration technology through three parts.
1. Problems and challenges faced by traditional big data architecture
Second, the architecture and flexibility of cloud-native data warehouses
3. Cloud-native data warehouse diagnosis and operation and maintenance
1. Problems and challenges faced by traditional big data architecture
The main challenges and problems faced by the traditional big data architecture are as follows: First, the data is scattered and inconsistent, and there is no unified system for analyzing these data. Second, the analysis is not real-time. Generally, the data will be cleaned and converted by ETL after 12 o'clock at night. The data can not be queried until the morning of the next day, and the timeliness of the data is poor. Third, the system is complex. In order to solve the problem of poor data timeliness, the general approach is to introduce a stream computing engine in batch processing to form a well-known lambda architecture, making the entire system more and more complex. Fourth, high learning costs. There are very few professional R&D personnel, which leads to very high salaries. Therefore, the cost of maintaining this system is also very high.
Second, the architecture and flexibility of cloud-native data warehouses
In order to solve the above problems, we built this set of off-line integrated architecture. Our vision is: Let users know how to use databases and then use big data. First, we are highly compatible with MySQL, and our compatibility with MySQL is more than 99%. AnalyticDB MySQL is a cloud-native architecture that separates storage and computing, and storage and computing can be expanded and contracted separately. We use a storage system to support real-time writing and multi-dimensional analysis, and use intelligent indexing to support arbitrary-dimensional analysis. In addition, we have complete enterprise-level capabilities such as auditing, self-built accounts, and a complete set of backup and restore capabilities. If you delete data by mistake, AnalyticDB can flash the data back to the point in time you want. Finally, our fused computing engine supports both offline and online, structured and unstructured data queries in the same architecture.
The entire architecture of the cloud native data warehouse AnalyticDB is divided into three layers, the top one is the access layer, which is responsible for generating an execution plan, and our optimizer will optimize the execution plan, generate the final optimal physical plan, and split the plan And send it to the computing layer for execution. The entire data storage is divided into two-level partitions. The first-level partition breaks up the data on each shard, ensuring the horizontal scalability of the entire system. The second part provides user-defined secondary partitions. You can partition by time, such as by day or hour. Our calculation engine will also automatically cut partitions based on these partitions. The entire storage engine supports strong and consistent real-time additions, deletions, and modifications. You can write these data with high concurrency, and the data is visible in real time after being written. At the same time, our calculation engine also supports mixed loads.
If users need an off-line integrated system, what functions are needed? First, you need to have the ability to support multi-dimensional analysis and ETL. At the same time, detailed query and retrieval of data must be supported. Finally, you have to support real-time high-throughput query and write. The intersection of these three requirements is what AnalyticDB wants to achieve. We achieve high-performance queries through a fusion computing engine that supports mixed loads; we achieve high-concurrency and high-throughput writes through mixed rows and columns and deeply optimized writes; then we use smart indexes to achieve detailed Query and retrieval of text in data.
Next, let's take a look at how we did it. The first is the writing part. There are two requirements for the writing part that is integrated off-line. First, high-concurrency data streaming is written. Second, for the existing stock data, it can be imported into AnalyticDB with high throughput. In the left part, it is highly concurrent. In the entire process, we have implemented various optimizations of data encoding and transmission, so that the flow of data in the entire process is zero-copy, and through shard-level parallelism and shard internal tables Level parallelism achieves high concurrency. Through this architecture, we have achieved tens of millions of data writes per second. The right part is a high-throughput write architecture. We use source vectorization to read data sources and calculation engine vectorization to write directly to storage to achieve high-throughput writing.
This part is about the high cost performance provided by AnalyticDB. If the user wants to store all the data on AnalyticDB, there will definitely be cold storage data and hot storage data. For example, a user wants to store data for three years, but it is possible that you only have a hot storage requirement for the data of the most recent week. Because the data in the last week needs to be checked frequently, the remaining data, if the user wants to store it in AnalyticDB at low cost, it will be stored in cold storage. So we provide three types of tables, one is a fully hot table, and all data is stored in hot storage. One is a completely cold table, and all data is stored in cold storage. There is also a mixture of cold and hot, that is, part of the data can be stored in the hot storage, and the remaining data can be stored in the cold storage.
Next, take a look at our detailed query. Detailed query utilizes the intelligent indexing capabilities of AnalyticDB. We have different indexes for different data types. We use CBO to estimate the level of index screening rate to decide whether to use the index. AnalyticDB uses different indexes according to different filter conditions, and finally gradually merges and returns the row number of the query result. Our internal data is stored in a mixed row and column method, and the data is further filtered through the rough set stored in the meta. We also used dictionary encoding to compress string data.
We have achieved the integration of offline and online in a computing system. For online query scenarios, users hope that their queries can be as fast as possible. We can return analytical query results in tens of milliseconds or even milliseconds. We achieve a very short delay by tuned up all the stages, and the operator processes the data in a streaming manner and without placing a disk. The one on the right is the offline scene, and the delay is not the first priority. The user hopes that the offline scene query can be completed stably in a fixed time.
ETL type queries may run for a few days. At this time, we adopt another batch execution method, and the whole process is very stable. The data will be placed in the shuffle between stages. We have done failover support for the downtime of Coordinator and Executor nodes, and at the same time, we realize the large-scale scheduling of sub-plans through adaptive batch scheduling. In the entire calculation process, we use Codegen to reduce the overhead of virtual functions and reduce the materialization of data into memory, thereby further optimizing our queries.
The problem that Adaptive Execution solves is that no matter how accurate the optimizer estimates, there will always be errors. It is possible that the final plan generated is different from the optimal plan I want to generate. Then we need to adaptively adjust the plan during the execution of the plan. We have implemented an adaptive partition based on the intermediate results of the data and an adaptive plan based on the data results, which play a role in the runtime correction plan.
After talking about calculation and storage, let's talk about the optimizer. We have realized the whole set of intelligent optimization. The optimizer is divided into two parts, the first part is the collection part of the underlying statistical information. We will automatically collect statistics on certain columns based on query conditions. Second, we will search for the optimal execution plan through the Cascades framework within the specified time. We use a set of optimizers to support the entire set of off-line queries. And our optimizer not only docks the internal data sources of AnalyticDB, but also supports external data sources such as those stored on OSS and HDFS. Achieved the integrated query optimization of the lake and warehouse.
In addition to some of the performance optimizations mentioned above, we have also made many other performance optimizations: such as source vectorized reading; vectorized algorithm optimization; automatic materialized view rewriting; cost-based maximum execution subtree reuse, etc. Wait.
AnalyticDB supports multi-dimensional elasticity, computing supports from 1 node to 5000 nodes, and ETL+ online analysis dynamically expands on demand. The elasticity of storage is divided into two dimensions: the storage capacity supports from GB to 100PB; the QPS support of storage nodes ranges from 1 to million.
Let's take a look at why we do flexible functions. These are all the queries made by our AnalyticDB in a certain week last year. We analyzed it. We found that only 5 out of 10,000 queries waited for more than 1 second. But looking at the instance level from another dimension, on the contrary, there are about 10% of queries that have a wait of more than 1 second or 5 seconds. This shows that five out of ten thousand queries are scattered across different instances. It shows that there are many scenarios in the business, and its query volume will temporarily exceed its estimated or expected value in a very short period of time, causing query queuing. At this time, flexibility can solve this problem well.
AnalyticDB provides three flexibility capabilities, the first is time-sharing flexibility. For example, you know that there will be a big promotion event from 4 to 8 in the afternoon. Before that 4 o'clock, we will pop up these computing nodes for users. The second is the ability of tenant isolation. If two departments have different queries running at the same time, the query of department A will not affect the query of department B. The third is flexibility on demand. This is mainly to deal with unpredictable traffic. We can pop up the nodes that users want on demand to ensure the SLA of high-priority services.
How do our time-sharing and on-demand flexibility do? We maintained a resource pool ourselves, and then wrote a set of resource managers on the pool. When users have flexible needs, we will remove nodes from this pool and add them to the user's AnalyticDB. When he runs out, we will automatically return this node to the resource pool. The whole process is very fast, and we can complete this operation in minutes.
AnalyticDB provides the ability to isolate resource groups. The resources of different resource groups are physically isolated. For example, the test query of department A will not affect the marketing query of department B.
3. Cloud-native data warehouse diagnosis and operation and maintenance
An excellent data warehouse not only needs to be good at the core, but also provides users with the ability to diagnose intelligently. Allow users to know where the problem is with their system. So we made a complete set of intelligent diagnosis system. This intelligent diagnosis system has many technical components and functional components, which are deeply integrated into our core. When you have a new query, we will detect whether there is an abnormality based on the clustering algorithm. If something is abnormal, we will connect to the smart alarm system and send you a message via Dingding, phone or email.
Our intelligent optimization provides automatic analysis capabilities; provides data warehouse modeling suggestions. According to the actual operation of the system, we will give specific suggestions to modify the data distribution or partition to make the system run more smoothly; at the same time, we Provides the ability to intelligently inspect and alert.
In terms of the value provided by the AnalyticDB offline integrated architecture to users, first, we provide the unification of the platform: users do not need to build a complex architecture to do offline integration; second, compared to self-built systems , We have a 3 to 10 times improvement in performance. And our entire architecture is real-time. Finally, we have good compatibility and ecology, which facilitates the migration of self-built clusters to AnalyticDB.
Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。