Real-time data warehouse introductory training camp: Hologres data import/export practice

Introduction to "Real-time Data Warehouse Introductory Training Camp" consists of Alibaba Cloud researcher Wang Feng, Alibaba Cloud senior product expert Liu Yiming and other real-time computing Flink version and Hologres technology/product front-line experts joined forces to build the training camp's courses System, carefully polished the content of the course, and directly hit the pain points encountered by the current students. Analyze the architecture, scenarios, and practical applications of real-time data warehouses from the shallower to the deeper, and 7 high-quality courses will help you grow from a small white to a big cow in 5 days!

This article is compiled from the live broadcast "Hologres Data Import/Export Practice-Wang Huafeng (Jianru)"
Video link: https://developer.aliyun.com/learning/course/807/detail/13891

Brief content:
1. Hologres Ecological Introduction
Two, Hologres real-time read and write interface introduction
Three, Hologres real-time reading and writing scene introduction
Four, Demo demonstration
5. Frequently Asked Questions and Future Prospects

Hologres ecological introduction

(1) Hologres ecology

图片 1.png

Hologres is a real-time interactive analysis product compatible with the PostgreSQL protocol, and it has also opened up the big data ecosystem. In terms of the most common open source components, such as Apache Flink, Spark, Hive, Kafka, etc., Hologres already has related Connector implementations and open source.

For real-time links, users can rely on Flink or Spark to import upstream data such as buried points or business data into Hologres with very high performance and millisecond delay. For offline links, Hologres also supports importing data from external systems in a very simple operation, and in turn supports backing up data back to external systems, such as Alibaba Cloud's MaxComputer, OSS, etc.

After the data is imported into Hologres, because Hologres itself is compatible with the PostgreSQL protocol, various ready-made query tools can be used to seamlessly connect to Hologres for data display and query.

(2) Dataworks data integration supports input

In addition to the big data scenario just mentioned, using Alibaba Cloud's Dataworks data integration function, we can also import user data stored in the traditional database into Hologres to achieve convenient and efficient real-time mirroring of the entire database.

图片 2.png

As shown in the figure above, the current Dataworks data integration supports the synchronization of real-time mirroring of MySQL's Binlog, SQLServer's CDC, and Oracle's CDC to Hologres. In addition, Dataworks also supports synchronizing data from Kafka and Alibaba Cloud Datahub to Hologres.

图片 3.png

It is also worth mentioning that the Datahub product itself also provides the function of directly synchronizing data to Hologres in real time. This function is called Datahub Connector. To use this function, users can directly import data into Hologres without going through Flink or other components, which is a quicker way for data synchronization without ETL.

Hologres real-time read-write interface introduction

图片 4.png

Hologres real-time read and write implementation principle

The above picture shows the principle architecture diagram of the entire Hologres real-time read and write implementation.

From top to bottom, the most upstream is the application side, that is, various clients that can read and write Hologres, such as data integration, Apache Flink, Spark, and so on. These clients usually use the SQL interface to send requests for reading and writing data to Hologres. These requests will pass through a load balancing server, and then these requests will be routed and distributed to a node called Frontend. A Hologres instance usually has multiple Frontend nodes, so that it can support very high QPS requests. The Frontend node is mainly responsible for SQL Parse, optimization and other functions.

After a series of processing, Frontend will convert the user's SQL request into a physical execution plan, and then these physical execution plans will be distributed to an execution node on the backend to execute the real physical read and write requests, and finally written The data will be persisted to a distributed file system, such as Ali's Pangu system or the open source HDFS.

It should be particularly emphasized here that normal SQL parsing, and then optimized through the Query optimizer to generate the optimal execution plan, usually this part of the link overhead is relatively large, for high QPS read and write scenarios, this often becomes a problem Performance bottleneck.

So for some common SQL scenarios, here we list a few SQL, as shown below.

图片 -5.png

Fixed Plan

For example, Insert into table values () simply inserts one or several rows. There is also Insert into table values () on conflict do update, which is to update several rows of data. Select * from table where pk = xxx and Delete from table where pk = xxx are to search or delete data based on the primary key.

For these common SQL, Hologres' Frontend has done a certain short-circuit optimization, omitting a lot of unnecessary optimization logic, directly generating an optimal execution plan, and sending it to the back-end execution node, so that the overall request can be improved Huff.

Let's take a look at how the physical execution plan is processed after it is sent to the backend.

图片 5.png
图片 6.png

The overall storage engine of the Hologres backend is based on Log Structured Merge Tree (LSM), where LSM can turn random writing into sequential writing, which greatly improves the throughput of data writing.

The write request will first be written to the Write Ahead Log, which is the traditional WAL file. Once the write is successful, we consider this record to be permanently written successfully. After that, we will apply the WAL log to the Mem Table. After the application is completed, the data will be visible to the outside world and can be queried. The delay is usually within milliseconds.

When the Mem Table is full, we will have an asynchronous thread to persist the Mem Table. The overall process is a more standardized LSM implementation.

This is different from other storage systems implemented by LSM, such as HBase. The Hologres backend adopts a fully asynchronous implementation. Based on the coroutine, it saves the operating system kernel thread overhead, greatly improves the system CPU utilization, and makes Hologres read and write. The performance is very good.

Let's go back and take a look at the data writing interface on the application side above. Now Flink, Spark and Dataworks actually use an SDK called Holo-Client to read and write Hologres.

图片 7.png

Holo-Client is based on Jdbc and encapsulates the best practices of reading and writing holo, which can reduce the workload of data integration development.
We have also made certain optimizations for some specific scenarios of SQL, such as:

Data write
1) Save batches, based on the realization principle of jdbc reWriteBatchedInserts;
2) Data merging, INSERT/DELETE with the same primary key will be merged in one batch to reduce the amount of requests;
3) Automatic submission, which supports automatic submission based on the number of batch lines, batch byte size, and the longest submission interval.
Data point check
1) Provide asynchronous check interface;
2) When the QPS is high, it will automatically switch to the batch mode.
Data Copy
Provide a simple interface for concurrent CopyIn.
Exception handling
Categorize the exception returned by holo, and correctly retry and wait for the instance to recover in scenarios such as holo upgrade and capacity expansion.

We highly recommend users to use Holo-client SDK if there is a scene to read and write holo in the future.

Hologres real-time read and write scenario introduction

After introducing the implementation principle of Hologres' read-write interface, let's take a look at several common read-write scenarios that can be implemented based on the Hologres read-write interface.

(1) Real-time writing scene

图片 8.png

图片 -9.png

The first is the simplest real-time writing scenario, as shown above.

Here we use an example of Blink SQL, which is actually to generate a random data source, and then import the data to Hologres. For real-time writing scenarios, Hologres supports two formats of row storage and column storage, and also supports deduplication based on the primary key, which is a very big bright spot compared to many other OLAP systems.

In addition, the real-time writing of Hologres also supports the update of the entire row of data or the partial update of the data. For performance, Hologres can be imported and has a very low latency, usually within milliseconds. After our own testing, taking the TPCH PartSupp table as an example, our back-end single Core can reach about 20,000 RPS, and the performance can be linearly expanded with the resources of the instance.

(2) Real-time wide table Merge scene

Then we will introduce the scene of real-time wide table Merge. Here we actually use the whole row partial update function of holo.

图片 9.png

Take the above figure as an example. For example, a user wants to merge data from multiple data sources into a wide table and write it to Hologres. We hope that the entire table will eventually have six columns A|B|C|D|E|F, and then a part of the data, for example, the four columns A|B|C|D are in one data source, and then A| B|E|F is in another data source, and we hope to merge the data of these two data sources into a wide table in Hologres.

A common implementation is that we will use Flink's Join function, which is to use two streams to consume the above data sources at the same time, and then join the two streams in Flink to widen the data, and finally write to Hologres.

But a problem with this scenario is that Flink's Join overhead is usually very large, because it needs to cache a lot of state, which is a very large overhead for the maintenance of the entire job.

Let's take a look at how Hologres solves this problem.

图片 10.png

As mentioned above, Hologres itself supports the partial update function of integrated data. As shown in the figure above, we can directly use two streams to write Hologres directly without having to do the Join in Flink. A stream such as A|B|C|D can write Hologres directly, and another stream A|B|E|F can also write Hologres directly. Because the data of these two streams have the same primary key, when two rows of data are written to Hologres with the same primary key, Hologres will perform a Merge internally, and finally achieve the function of data widening, eliminating the need for users to write by themselves Flink Join, and the issue of maintaining such a complex job.

(3) Real-time table join scenario

After introducing the scenario of real-time writing to Hologres, let's take a look at the scenario of real-time reading.

There are usually two types of real-time reading. The first one is our common Flink real-time dimension table Join scenario. The dimension table Join is an implementation of point-checking.

图片 11.png

图片 12.png

Here, Hologres' row storage table can usually replace HBase to implement Flink's dimensional table function, providing very high throughput and very low latency.

(4) Hologres Binlog scene

The second scenario for real-time reading is the Hologres Binlog scenario. Binlog and MySQL Binlog are a similar concept. Using Hologres Binlog we can consume Hologres single table Change log in real time, and can track and record the update of each row of data.

Now the Flink version of Hologres CDC Source can be calculated in real time, which can realize real-time mirror synchronization of tables, and even use Flink + Hologres to realize real-time ETL from ODS to DWD tables.

The Binlog function of Hologres is disabled by default.

图片 13.png

The above figure is an example, which lists how to use Hologres' Binlog. Here is a DDL for table creation.

You can see that we have two additional table attributes, one is called binlog.level, set to replica, which means that this table will turn on the Binlog function of Hologres, and'binlog.ttl' represents a life cycle of Binlog data. We will use it below Hologres Binlog to see what kind of effect can be achieved.

图片 14.png

Since Hologres is a strong Schema data warehouse, we can even use the SQL interface to query Hologres' Binlog. As shown above, here we can query the Binlog of Hologres by providing several hidden columns: hg\_binlog\_lsn, hg\_binlog\_event\_type, hg\_binlog\_timestamp\_us.

Here hg\_binlog\_lsn represents the LSN sequence number generated by each Binlog, and then hg\_binlog\_event\_type represents the message type of Binlog. Does it represent Delete or Insert, or Before Update, or say After Update. Here hg\_binlog\_timestamp\_us represents the time when this Binlog was generated.

With these hidden columns, users can easily use SQL to query Binlog and debug data.

Demo demo

(1) Real-time calculation Flink version real-time read and write Hologres Demo

After introducing the read and write scenarios of Hologres, let's take a look at how to use Flink to read and write Hologres in real time through the actual operation of the Demo.

图片 15.png

As shown in the figure above, first of all, we have two tables of Hologres here, both of which will enable Binlog. We assume that the two will be written in real time, and then we will write another Flink task to implement the Binlog consumption of these two tables, perform the Binlog Merge of these two tables, and even perform certain group by calculation operations, and finally After the data of these two tables are written synchronously, another result table of Hologres is closed.

Next, let's enter the demonstration. First, let's take a look at the DDL of the Hologres table, as shown below.

图片 -16.png

Table DDL for table a

图片 -17.png

DDL for table creation of table b

These two tables have two identical fields, called id and create\_time respectively, after which a data aggregation will be performed. Each table will also have a different value, value\_from\_a is unique to table a, and value\_from\_b is unique to table b.

图片 -18.png

Structure table

Finally, we will have a structure table. This result table has two columns shared by the two tables a and b. The other two columns a and b are obtained from the two tables a and b. We hope to combine the values of a and b. The data is aggregated in real time and written into the sink table.

Let's take a look at the entire Flink SQL.

图片 16.png

The first here is to declare the two source tables of Hologres separately, and the Binlog of the two tables of Hologres needs to be consumed in real time.

It should be noted that we need to enable the parameter'binlog'='true' to allow Flink to consume Hologres' Binlog and enable CDC mode.

图片 16.png

Result table

Then we look at the statement of the result table, as shown above.

It should be noted here that we need to set an'ignoreDelete'='false' to prevent us from ignoring the type of data including Delete or beforeUpdate, resulting in data inconsistency.

图片 18.png

Let's take a look at the SQL of the entire Flink calculation logic, as shown above.

The logic here is actually relatively simple. In fact, it just combines the results of the two tables, and then performs a real-time sum of group by id and create\_time, and writes them to the result table of Hologres.

After the job here is online, we can directly start and run the job.

During the startup process, we can look at the status of the current Hologres tables.

图片 19.png

You can see that the current Hologres tables are all empty tables, we will update these tables, and then look at the process of data synchronization.

图片 20.png

First insert a piece of data into table a, you can see that the data in table a has been synchronized to the result table in real time.
Next, make an update to the b table data.

图片 21.png

It can be seen that the data of these two streams has been updated to the result table in real time, and accurate data aggregation has been carried out.

Next, we will update the a table.

图片 22.png

It can be seen that the real-time update of the source table a has been correctly reflected in the result table, and Flink has calculated the results of the two streams very correctly.

We can even look at the binlog data of this sink table, because our structure table also enables the binlog function, as shown below.

图片 23.png

It can be seen that we have obtained all the change records of this table, which is consistent with our expected effect.

The above is the demo of Flink reading and writing Hologres in real time.

(2) Dataworks real-time synchronization of Hologres Demo

Next, let's take a look at the demo of using Dataworks to synchronize PolarDB data to Hologres in real time.

First, we enter the data integration, data synchronization needs to add a data source, click on the data source to add.

图片 24.png

Then add a new data source, select Hologres, and fill in all the information, we can add a data.

图片 25.png

New data source

Next, we will demonstrate the data synchronization.

图片 26.png

As shown above, first there is a PolarDB database, and a user\_details table has been created in advance. You can see that there are already three query result records here, and then we hope to synchronize the data of this table to Hologres. .

Then we return to the data integration and click one button to synchronize to Hologres in real time, as shown below.

图片 27.png

In the basic configuration, select the pre-created data source PolarDB as the data source, and then select the table user\_details that needs to be synchronized, and then click Next.

图片 28.png

图片 29.png

After that, we will need to select the data source of the target Hologres, refresh it after adding it, and then refresh the user\_details table, and then configure whether this table needs to be automatically built, or use an existing table, here we choose automatic Create a table, and then click Next.

图片 30.png

In the DDL message processing rules, we can configure a variety of strategy processing, and select the next step after configuring the rules according to requirements.

图片 31.png

Next, run resource configuration. For real-time synchronization of Dataworks data, we usually need an exclusive resource group, where we have completed the purchase of the exclusive resource group, and then select the resource group required by each synchronization function, complete the configuration and click Execute now, waiting for the job start up.

图片 32.png

You can see that the PolarDB data has been synchronized to the structure table of Hologres in real time.

图片 33.png

Next, we can make certain updates to this table. We re-insert a piece of 1004 data into this user\_details table. After the data is inserted successfully, we can look at the Hologres structure table.

图片 34.png

It can be seen from the background that the data of 1004 has been synchronized to Hologres in real time, as shown below.

图片 35.png

As you can see from the demo above, using the Dataworks real-time synchronization of Hologres function, we can synchronize the data in the database to Hologres very conveniently.

Frequently Asked Questions and Future Outlook

(1) Frequently asked questions about real-time calculation of Flink version of Hologres Connector

After the above-mentioned application scenarios of Hologres and several demos, let's take a look at what problems you usually encounter during use.

Q: The job failed to start, unable to connect to Hologres.
A: The Hologres instance needs to be in the same Region as the Flink cluster and use VPC Endpoint.

Q: The real-time data written in the result table does not meet expectations.
A: Usually caused by retracement, the ignoreDelete parameter needs to be set correctly.

Q: Real-time write performance is slow.
A: In the current high-QPS scenario, the local update overhead of the column storage table is relatively high. It is recommended to replace the entire row update or row storage write.

Q: The query performance of dimension tables is poor, and the CPU load of the Hologres instance is high.
A: It is usually because the column-stored table is used as the dimension table. It is recommended to switch to the row-stored table.

161011c9a199e0 Q: Real-time consumption of Binlog
A: It is usually caused by the table not opening Binlog, and the table needs to be rebuilt.

(2) Future prospects

Next, let's take a look at a future plan and outlook for the entire Hologres on the real-time read-write link.

-Flink One-To-Many Dimension Table Join

This is a function that will be launched soon. We will implement the one-to-many dimension table Join function in Flink, so there is no need to force the use of the primary key of the Hologres table for dimension table query.
However, it should be noted that the performance in this scenario is usually not particularly good, because the inevitable query will cause the entire table to be scanned, making the delay relatively high, so it is recommended that users use one-to-one spot-checking scenarios as much as possible.

-Real-time consumption of Hologres Binlog based on JDBC

The current implementation of Hologres Binlog uses a built-in interface, which has not been revealed to the outside world for the time being. After that, we will implement a JDBC-based interface to allow users to consume Hologres Binlog in real time.

-Dataworks data integration real-time consumption Hologres Binlog

Current data integration does not support the consumption of Hologres data. Later, we will support data integration using Dataworks to consume Hologres Binlog in real time, so that Hologres data can be mirrored and synchronized to other databases in real time.

-SQL read and write without connection restrictions

Due to the PostgreSQL model limitation, the number of connections of the entire Hologres instance is currently limited, and we will provide a function of SQL read and write without connection restrictions.

Some of the Hologres connectors and Holo-Clients mentioned above have been open sourced to Github. Users in need can visit the link below to use them.

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Real-time data warehouse introductory training camp: Hologres data import/export practice

Hologres ecological introduction

(1) Hologres ecology

(2) Dataworks data integration supports input

Hologres real-time read-write interface introduction

Hologres real-time read and write scenario introduction

(1) Real-time writing scene

(2) Real-time wide table Merge scene

(3) Real-time table join scenario

(4) Hologres Binlog scene

Demo demo

(1) Real-time calculation Flink version real-time read and write Hologres Demo

(2) Dataworks real-time synchronization of Hologres Demo

Frequently Asked Questions and Future Outlook

(1) Frequently asked questions about real-time calculation of Flink version of Hologres Connector

(2) Future prospects

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性