HTAP (Hybrid Transaction / Analytical Processing) was first proposed and given a clear definition in 2014: that is, to support OLTP and OLAP scenarios at the same time, an innovative computing and storage framework is required to ensure transactions on a piece of data while supporting Real-time analysis, eliminating time-consuming ETL process. As the world enters the digital age, digital technology penetrates into all walks of life and generates massive amounts of data. Data storage and application have become an important basis for enterprise decision-making. Changes in business scenarios have also set off a wave of HTAP.
So what exactly is HTAP and what does it mean for users and developers? The fifth issue of "Dialogue ACE" invited Gong Zijie, head of OceanBase Internet & Overseas Architects, and Oracle ACE, Yunhe Enmo Product Support Engineer Wu Weilong . During the live broadcast, the two teachers had an in-depth discussion on "the true meaning of HTAP". Uncover the mystery of HTAP for everyone.
Teachers Wu Weilong and Gong Zijie conducted in-depth discussions on HTAP from their own perspectives. The following is a transcript of the dialogue. The following is the transcript of the conversation:
📣 Ocean Base Internet & Overseas Architect Leader Gong Zijie
Q: In what form or architecture is the HTAP capability of OceanBase realized?
A:
Students who have learned about the architecture of OceanBase should know that OceanBase has adopted the MPP architecture for its overall technical architecture since 1.0 inside Ant in 2015, and it has been retained to this day. It has the ability of theoretically infinite horizontal expansion, which is also the architecture chosen by many data warehouse databases, similar to GreenPlum. At the same time, the online transaction Oracle replacement of the ant core link needs to be carried out inside the ant. The SQL is characterized by typical point checking and point reading. , small transactions, high concurrency scenarios, so we have optimized a lot of TP scenarios, including optimizers, transactions, etc., to meet the business requirements of Mission Critical scenarios, the bottom layer uses LSM storage engine with complete ACID transaction attributes.
Starting in 2019, under the MPP architecture of OceanBase, it is natural that we began to introduce a distributed execution framework in the SQL engine executor part for complex query scenarios. The computing scalability of the MPP framework can be used to speed up DML and Query, and the storage part introduces Database coding technology, when the data is dirty, it is stored by row and column by column coding, and a mixed storage of rows and columns that is transparent to the business is realized on a piece of data. At the same time, in OceanBase 3.X in 2020, the ability to vectorize SIMD instructions based on modern hardware has been optimized, and TPC-H has achieved the first place on the list.
Q: What do you think are the typical advantageous scenarios of HTAP?
A:
Compared with pure TP or pure AP, HTAP is a niche market with real business scenarios and demands. Recently, there is a relatively popular term: real-time data warehouse, upstream online transactions use MySQL, and business needs to be based on TP landing data in real time. For C-side quick feedback, such as real-time risk control, transaction history details query, fraud monitoring, thousands of people, thousands of faces, etc., the traditional digital warehouse ETL link is long and the delay is large, and it is difficult to meet the fast and changeable demands of the business.
Now the popular solution is to drag a copy of TP data to the aggregation library through the log parsing tool, and perform the query in the aggregation library. For the business, multiple copies of data are required and the disaster recovery needs to be packaged, and the cost remains high. This scenario is It is especially suitable for HTAP, reducing IT investment and reducing post-operation and maintenance costs.
Q: How can the HTAP architecture be better applied in actual business scenarios?
A:
The HTAP database needs to be able to handle high-concurrency and massive write scenarios, while also handling some complex query scenarios in quasi-real time, such as multi-table associations, batch import and export scenarios, etc. Ocean Base generates three copies of the Paxos protocol to achieve lossless disaster recovery. The scenario needs to be able to identify the SQL of the online transaction scenario and the SQL of the complex query scenario, using the multiple copies of OceanBase to use the Hint method to distribute the SQL to reduce the resource contention of the TP/AP. The Rrsource Manager capability of OceanBase is used to perform SQL marking and resource isolation on the master copy to ensure that the query is responsible for the jitter impact of the core scenario.
Q: For DBAs and operation and maintenance personnel, what should be paid attention to when facing the application of HTAP architecture?
A:
In the original architecture where TP/AP used ETL for data transmission separately, DBA needs to spend a lot of energy to maintain the stability and timeliness of the link. After switching to the HTAP database, the DBA no longer needs to spend energy on link maintenance. Administrator switched to Database architect, deeply involved in the business architecture design, and identified the most suitable combination of business architecture and HTAP database according to business logic.
For example, Ctrip's DBA is based on the open source version of OceanBase. On the basis of a deep understanding of business logic, OBProxy has been transformed. By deploying different proxies independently, the business can be implemented without intrusion of business code to access different proxies to achieve TP/AP separation. This is Very typical example.
Q: What do you think is the core difference between HTAP with a centralized architecture and HTAP with a distributed architecture?
A:
Centralized HTAP is similar to Oracle SQL Server, and it is designed to face mixed load scenarios. HTAP can be divided into small htap or big htap. The biggest difference is the ability to scale horizontally. The main solution for centralized databases is to scale up when they encounter performance bottlenecks. Distributed databases such as OceanBase use the Scale Out method.
The biggest difference between the HTAP of the centralized architecture and the HTAP of the distributed architecture is that the scale up of the centralized database has an upper limit, such as CPU specifications, such as IO throughput and IOPS capability, so small-scale business may use Oracle/SQLServer database bureau is enough , if the amount of data increases over time, providing more computing resources and IO resources through horizontal expansion can well meet the growth of business data.
📣Wu Weilong , Product Support Engineer of Yunhe Enmo
Q: With one piece of data, can HTAP truly achieve both OLTP and OLAP?
A:
In 2014, Gartner first proposed the concept of HTAP (Hybrid Transaction / Analytical Processing) to give a clear definition: that is, to support OLTP and OLAP scenarios at the same time, an innovative computing and storage framework is required to ensure transactional integrity on a piece of data. At the same time, real-time analysis is supported, eliminating the time-consuming ETL process. The HTAP model can indeed take into account both OLTP and OLAP; but HTAP does not mean that it is a panacea, nor does it mean that an organization has only one set of HTAP databases. There are both technical and non-technical factors. An organization will have multiple different business departments, and related applications will be split, which leads to different decision-making departments between OLTP databases and OLAP databases. Even OLTP databases will be split according to business. A company-wide system is basically unrealistic in most companies, and a more realistic approach is to have a set of HTAP databases for each business. For example, a set of HTAP database for transaction business supports real-time processing of online transactions and real-time analysis of historical orders.
Q: In which industries do you think HTAP will be applied more often, what are the business pain points of choosing HTAP for these industries?
A:
Due to its own characteristics, HTAP can be applied to both transactional database scenarios and analytical database scenarios; it is especially suitable for complex, multi-modal, and time-sensitive application scenarios; data does not need to be imported from operational databases to decision-making. For example, e-commerce, financial industry orders, payment information need to be synchronized to the inventory data of the settlement library in real time for settlement and reconciliation, transaction data statistics of various channels, accurate asset loss prevention and control, these information actually need to achieve rapid data synchronization , traditional ETL it can't do it so fast.
Q: How do you think HTAP can maintain data consistency?
A:
Two technical concepts in the database: "Snapshot isolation level (Snapshot)" and "Multi-version concurrency control (Multi VersionConcurrency Control, referred to as MVCC)". The meaning of these two technologies is: by maintaining different version numbers in the database (that is, multiple different snapshots), when data is modified, different version numbers can be used to distinguish the content being modified from the content before the modification. , in order to achieve concurrent access to multiple versions of the same data, avoiding the read-write conflict problem caused by the "lock" mechanism in the classic implementation. Traditional databases do this. For example, Oracle usually assigns SCN (system change number) as the version number. Different SCNs represent the "Committed Version" of the data at different points in time, thus realizing the snapshot isolation level of the data.
Then, there are two implementations in the distributed database industry:
1) Using special hardware devices such as GPS and atomic clocks, the system clocks of multiple machines are kept highly consistent, and the error is so small that the application cannot perceive it at all. In this case, it is possible to continue to use the local system timestamp as the version number, while also satisfying external consistency on a global scale.
Then Google Spanner is a typical GPS clock and atomic clock to achieve data consistency and support a variety of different transactions. Spanner supports three types of transactions, namely snapshot read, read-only transaction, and read-write transaction.
2) The version number no longer depends on the local system clock of each machine. All database transactions obtain a globally consistent version number through a centralized service, which ensures that the version number is monotonically forward. In this way, the logical model for obtaining the version number in the distributed architecture is the same as the logical model in the single-point architecture, which completely eliminates the factor of clock differences between machines.
Then the distributed database OceanBase will face the technical challenges brought by the distributed architecture when it implements a globally (cross-machine) consistent snapshot isolation level and multi-version concurrency control. To address these challenges, OceanBase database introduced the "Global Consistent Snapshot" technology in version 2.0.
With the "globally consistent snapshot" technology, the OceanBase database has the ability to achieve "snapshot isolation level" and "multi-version concurrency control" on a global (cross-machine) scale, and can ensure "external consistency" on a global scale. ”, and on this basis implement many functions involving global data consistency, such as global consistent read, global index, etc.
In this way, compared with the traditional single-point database, OceanBase retains the advantages of the distributed architecture without any degradation in global data consistency. Application developers can use OceanBase just like a single-point database without worrying about it. Low-level data consistency issues between machines. It can be said that with the help of the "Global Consistent Snapshot" technology, the OceanBase database perfectly realizes the global data consistency under the distributed architecture!
Q: We know that real HTAP has high requirements for transaction and analysis capabilities, but it will be difficult to realize it. So regarding transactions and analytics, which one would we choose first?
A:
HTAP is designed to meet the two high-demand scenarios of transaction and analysis; if it comes to greater value, I believe it will be better to choose HTAP in OLTP scenarios; because transaction data is rapidly changing in the organization If these data are extracted through traditional ETL, once a problem occurs, it may affect the business of the entire enterprise, which is beyond our organization's ability to bear.
The HTAP capability required by the organization does not have to cover the data warehouse business completely, but only needs to improve the online analysis capability required by the core business. Therefore, what needs to be stored in the HTAP database is the data of the OLTP system itself and the high-value data extracted from the outside for some analysis.
Q: How do you think HTAP can be more perfectly integrated with the cloud?
A:
Now almost all database manufacturers and cloud service giants are deploying HTAP. "New generation HTAP + cloud" is becoming an important trend in the database market. Cloud databases are no longer the problems of traditional database deployment, operation and maintenance, and expansion. Providing services in the cloud can make database use easier; another reason is that with the popularization of cloud computing, the number of users on the cloud continues to increase. Demand feedback from user groups on the cloud occurs all the time, which is crucial to the evolution and iteration of database products.
Q&A
Q: Do I need to add an additional copy to query the standby database?
A:
At present, many manufacturers are making HTAP databases, and each manufacturer's implementation is not exactly the same. In the case of three copies of TP to do complex queries, it is possible to introduce such a technology stack as ClickHouse or others, and then put complex SQL in this column-stored database by row-to-column.
From OceanBase's point of view, this is not a real HTAP database. Regardless of whether it is TP-oriented or AP-oriented, three copies of one data are used. In this case, it does not require additional disaster recovery costs.
Q: If no new copy is added, will the switchover be affected if the primary node fails due to the pressure of a standby node?
A:
Taking OceanBase as an example, to give a typical scenario, when the main cluster does a connection transaction, the two standby replicas do complex queries to separate read and write; when the replica of HT in the main cluster fails, because OceanBase uses the Paxos protocol, Therefore, it will automatically select one of the remaining two replicas to become the new leader. This process does not require manual intervention and ensures that data is not lost.
The above is the wonderful sharing of the two teachers. If you have any questions, you can leave a message at the end of the article to discuss. We will see you in the next issue of OceanBase "Dialogue ACE"!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。