Data warehouse ETL system: equip a powerful “heart” with a “super pipeline”

Abstract: During the construction of the data warehouse platform, the loading and unloading of data, the flow of data between various layers of data models, the realization of business rules, etc., will be implemented in the form of ETL tasks.

This article is shared from the HUAWEI cloud community " with a "super pipeline" for a powerful "heart"-GaussDB (DWS) data warehouse platform ETL system construction plan ", the original author: babu1801.

I. Introduction

During the construction of the data warehouse platform, the loading and unloading of data, the flow of data between various layers of data models, the realization of business rules, and other data processing processes will all be implemented in the form of ETL tasks. Building the ETL subsystem is a very important part of the implementation of the data warehouse system. Building a complete and standard ETL subsystem during the construction of the warehouse platform is one of the basic goals of the data warehouse platform construction.

ETL is the abbreviation for the three data processing actions of Extraction (data extraction), Transform (data conversion) and Loading (data loading). It is also the data flow processing sequence of the early data warehouse construction, so the special term formed is still used today. But with the continuous development of database engine technology as the core of the data warehouse, the ETL model is also constantly evolving and changing, gradually forming different forms such as ELT and ETLT. For the MPPDB data warehouse platform represented by GaussDB DWS, the ELT or ETLT mode is mainly used to construct the ETL subsystem.

Second, the logical reference architecture of the ETL subsystem

The purpose of ETL subsystem construction is to integrate business data from heterogeneous data sources that are scattered, messy, and non-uniform in the enterprise, and perform necessary cleaning and conversion to form a high-quality unified data model, or to facilitate users Dimensional model for query, analysis and exploration.

Figure 1 Data warehouse subsystem reference architecture

2.1 Data extraction (Extraction)

Data extraction is the process of capturing full or incremental data from the upstream system of the data warehouse (usually the core system, business system or external system). With the improvement of the underlying information architecture of the enterprise and the division of data platform functions, different platforms are usually associated with each other in a loosely coupled manner. The traditional mid-downstream system to directly extract data from the upstream system does not conform to the current technology development trend. On the one hand, the downstream system directly goes to the upstream system for data extraction operations, which involves the open management of permissions, which increases the data security risk of the upstream system. On the other hand, the data extraction operation itself should also be performed in the time window after the normal business of the business system itself is completed to avoid resource competition in the normal operation process during data extraction. Therefore, the operation of the data extraction link is usually the upstream and downstream systems for interface negotiation, and the upstream system performs the data offloading operation in accordance with the interface specification. Or for more mature enterprises, a unified data exchange platform will be built to complete the unified data extraction/offloading work within the enterprise.

For the data warehouse platform, the work of data extraction is more to form a unified interface specification.

2.2 Data Conversion (Transform)

Data conversion in a broad sense includes operations such as data cleaning, data association processing, data standardization processing, and data aggregation and aggregation. Most data conversion operations based on business rules and data models are more efficient to implement in the MPPDB database than on the ETL server outside the database. And this kind of conversion operation realizes the T process through SQL in the database, which is more standardized and open than the realization of T process through ETL tools, and is suitable for business personnel to participate in the development and verification of the T process.

2.3 Loading

For the data warehouse, it is not only data loading, but also data unloading, which is the process of Loading and Unloading. A typical scenario is to load a large number of files into the database during the peak period of data arrival. After the data processing in the library is completed, the data is unloaded in a timely manner, forming an interface file and pushing it to the downstream system. Therefore, efficient batch data loading and unloading operations are one of the main challenges faced by the data warehouse ETL system. With the increasing demand of customers for real-time data warehouses, the technology of real-time data loading and unloading between databases, message queues, and data flow components is another technical challenge facing the current ETL system construction.

Three, two implementation architectures of ETL subsystem

Relying on the GaussDB (DWS) database to build an ETL system generally has two implementation methods: the heavy ETL Server solution and the MPPDB solution. As shown below

Figure 2 Schematic diagram of the two architectures

3.1 Re-ETL Server solution

This solution uses specialized ETL software: Informatica, DataStage, Kettle and other software to deploy ETL software in a distributed/shared storage-based ETL server cluster mode. When executing the ETL task, the data is read from MPPDB, the data processing process is completed in the ETL server, and the processed results are pushed to the database server after processing. Some of these operations can be completed in the database through SQL Push down.

The feature of this solution lies in the combination of graphical operation and scripted operation in the entire ETL development and deployment process. Tool-based process development can also perform meta-data-based blood relationship analysis and impact analysis on the ETL process; automated job scheduling and scheduling aspects The function of ETL tools is weaker than that of professional scheduling software.

ETL tool-based solutions require professional developers for the ETL development process, and they must have a deep understanding of the ETL tools themselves. In this regard, too specialized tool thresholds are not conducive to the intervention of business experts and analysts within the enterprise ETL development process. The investment cost of software and hardware in ETL is also an issue that needs to be taken into consideration.

3.2 MPPDB scheme

In this solution, the ETL server is lightweight, and the production environment generally provides active and standby servers to avoid single points of failure. The main features are as follows:
Using the MPPDB parallel processing engine, massive data ETL processing efficiency is higher.
ETL process SQL is templated, and the cost of rapid development and iterative process is low;
The ETL processing logic of the summary layer and the market layer is generally strongly related to business rules, and the SQL standard has a low development threshold for business personnel
Decoupling from third-party ETL servers, through tool packaging, can avoid excessive reliance on a certain ETL tool;
The ETL script template needs to be customized encapsulated development to provide the underlying data for the process of operation and maintenance, optimization, and data governance.

3.3 Simple comparison

The heavy ETL Server solution is suitable for file-based data cleaning ETL work: conversion of character sets; preprocessing of interface data in accordance with interface specifications (judgment of file size, number of records, and other file information and data quality checks on attributes); File grouping, splitting, compression, decompression, etc.; and extended file monitoring and transmission functions.

The MPPDB solution is actually a SQL-based implementation solution, suitable for data standardization processing: such as business coding conversion, business logic primary key generation, data conversion processing that meets business specifications; data conversion processing: aggregation, aggregation, filtering, association, splitting, and conversion Wait.

Four, GaussDB (DWS) ETL system implementation points

For GaussDB (DWS), the MPPDB solution is recommended for most occasions. The realization of this kind of scheme actually needs to realize the encapsulation of ETL SQL template, the combination of ETL development process and the external ETL scheduling system for hierarchical processing. Implementing and scheduling software through templates, encapsulating the interface with the operating system, encapsulating the SQL implementation business module in the GSQL tool, and opening it to business personnel and developers, so that they can focus on the business implementation itself, regardless of the external environment for ETL The impact of process operations.

4.1 Logical view of ETL environment based on MPPDB

Figure 3 Logical view

ETL scheduling

The ETL operating system of the data warehouse platform is a batch data processing system that runs in a non-interactive way in the background. ETL job scheduling is to automate various background jobs running in the data warehouse system, and to monitor and control the operation of the job. Use scheduling software to achieve job scheduling. Jobs can be distributed on multiple server platforms, and job definitions, dependencies, sequence relationships, work group relationships, etc. can be set to facilitate automatic scheduling, operation and management of jobs.

The scheduling supervision platform can dynamically monitor and control the operation of the job in a graphical manner, and provide detailed information on errors/warnings in the execution of the job.

ETL script package

GSQL is a tool for executing SQL, but there are some missing functions in the combination with scheduling software, such as parameter analysis, log analysis, exception handling, etc., so it is necessary to encapsulate GSQL to improve the fit between the scheduling software degree.

gsql template

Abstract and summarize the etl process of processing to form an algorithm template.

Specify the necessary input parameters, set the common parameters for session startup, and piling for subsequent optimization and tracking of buried points.

Call form

Scheduling tool->Python or other script tool templates->GSQL->{.gsql}

4.2 GSQL package

Figure 4 Schematic diagram of GSQL package

Increase the need for packaging:

Decoupling GSQL and scheduling software: Scheduling software has the ability to call Python/Perl/Shell scripts. Through script encapsulation, GSQL and scheduling software are decoupled, reducing the risk of adaptation compatibility between GSQL and scheduling software;

Function points that need to be considered in the package template:

scheduling command to GSQL running command:

The scheduling commands are relatively simple and related to business logic: such as business subsystem code, algorithm template code, data date, etc.;

The operating parameters of GSQL do not need or should not be exposed to the scheduling system interface: additional parameters such as login passwords, verbose parameter levels, etc. that are irrelevant to the business or to facilitate operation and maintenance, and performance tracking;

login password encryption and decryption:

GSQL login passwords are not allowed to be stored in plain text, and need to be stored in cipher text on the ETL server. During execution, they also need to avoid being seen by the ps command in the background command line;

exception handling:

Exception handling function after GSQL script operation error: such as re-run, alarm notification, etc.;

running log analysis:

GSQL running log analysis: analysis and tracking of different statements in GSQL scripts, execution time of different transactions, analysis and tracking of errors, and alarm codes, providing the most detailed underlying data for performance analysis;

V. Summary

This article gives a preliminary introduction to the ETL subsystem of the data warehouse, explains the two current mainstream ETL subsystem implementation architectures, and compares the ETL architecture of the MPPDB database. Finally, the main points of the realization of the ETL subsystem under GaussDB (DWS) are sorted out, and the main points of the logical view realized by etl and the function points of GSQL encapsulation are elaborated. In future chapters, the author will give a more detailed introduction to the best practices of the specific package implementation of gsql.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Data warehouse ETL system: equip a powerful “heart” with a “super pipeline”

I. Introduction

Second, the logical reference architecture of the ETL subsystem

2.1 Data extraction (Extraction)

2.2 Data Conversion (Transform)

2.3 Loading

Three, two implementation architectures of ETL subsystem

3.1 Re-ETL Server solution

3.2 MPPDB scheme

3.3 Simple comparison

Four, GaussDB (DWS) ETL system implementation points

4.1 Logical view of ETL environment based on MPPDB

4.2 GSQL package

V. Summary

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式