Aurora Notes丨Data Quality Construction Practice

Author: Aurora Data Architect-Kuang Donglin

01 Summary

This paper proposes a method for building a big data quality system, which can monitor the data quality of ETL tasks in the data processing process, and perform necessary alarms or task suspensions based on the monitoring results. The monitoring task can be started incrementally, and there is no need to make any modification to the existing ETL tasks. The opening or closing of the monitoring task does not affect the dependency of the original ETL task.

02 background

With the development of the performance of Jiguang's various business lines, it is increasingly dependent on the data analysis capabilities provided by the data center.

There are more and more ETL tasks running on the data center data platform, and the processing logic is becoming more and more complex. The dependency chain of complex ETL tasks is getting longer and longer. These ETL tasks are often developed by different business personnel from different teams. The development level of ETL is uneven, and communication between different teams may not be smooth, resulting in frequent data quality problems of ETL tasks. What is more serious is that after quality problems occur, the dependency chain of the tasks is very long, resulting in investigation data Quality problems are difficult and require upstream and downstream investigations one by one, and developers from different teams need to coordinate the investigation, which greatly reduces the efficiency of data output.

In addition, because there is no unified data quality monitoring measure, it is often that the business finds that the data report has a problem and then feeds it back to the developer. The developer passively looks for the problem and repairs the error, and lacks the initiative to find the problem.

Finally, because there are no unified data quality monitoring measures, data quality problems are often only perceived after they have spread, resulting in a lot of waste of computing resources and time-consuming repair procedures.

In order to completely reverse this situation, a powerful data quality monitoring system is very necessary. This paper designs a set of data quality monitoring system based on the actual situation of Jiguang, which perfectly solves the above-mentioned problems.

03 Design plan

Aurora’s data quality monitoring platform relies heavily on the underlying functions provided by the ETL task scheduling platform, and its overall architecture is shown in Figure 3-1

Figure 3-1 The back-end architecture of the data quality monitoring project

The main tasks of the data quality service background are:

1. Responsible for receiving front-end configuration information and storing it in the back-end database;

2. Accept the tracking logs of the quality inspection process such as index collection, index evaluation, task alarm braking, etc. reported in the quality inspection process;

3. Obtain the scheduling task list and the task configuration information related to quality inspection from the scheduling system;

The scheduling node in the figure is essentially a scheduling process started by the scheduling system. In the normal scheduling process, the scheduling process runs the corresponding code according to the configured task type. For example, if the running task is a hive script, the scheduling process will pass The hive client submits the hive script to the hive server to execute the script. Here, the hive client code required to run the hive script is unified and abstracted into a SideCarProxy component, providing a unified interface, and different types of ETL tasks pass through different SideCarProxy. Implementation to submit and run different types of ETL tasks, for example, the SideCarProxy implementation of Spark tasks will provide the function of submitting spark tasks, and Spark-type ETL tasks only need to be responsible for the implementation of specific business logic.

In order to support the ETL task to enable the data quality check function, we have expanded the function of SideCarProxy. On the basis of maintaining compatibility with the original interface, we have expanded three interfaces, namely the data quality indicator collection interface, the data quality indicator evaluation interface, and the data quality check action. Interface and data quality alarm configuration interface. These quality inspection interfaces can be further divided into pre-inspection, mid-inspection and post-inspection according to the time point when the inspection is triggered.

Pre-check is a quality check triggered before the ETL task is run. It is generally used to check whether the input data of the ETL task meets the requirements; in the middle of the check, it is a checking mechanism to monitor the running status of the task during the running of the ETL task. It is often used Task start time, running time, completion time and other indicators are checked; post-check is a quality check triggered after the completion of the ETL task, and is generally used to check whether the output data of the ETL task meets the requirements.

3.1 Data quality index collection interface

The inspection of data quality relies on the definition of data quality inspection indicators. The definition of this indicator is completely defined by the business according to its own needs. In order to provide the convenience of using the system, the system has built-in some commonly used data quality inspection indicators, such as task startup delay, running time, and completion delay. For business-specific indicators, the system supports customizing business-specific data quality collection indicators through sparksql, as shown in screenshot 3-2

Figure 3-2 Pre-defined indicators and custom indicators for quality inspection

3.2 Data quality indicators

Using the data quality index collection interface, data quality can be monitored from various perspectives such as data timeliness, completeness, consistency, and statistical characteristics. The following examples illustrate how to monitor these indicators.

3.2.1 Timeliness index monitoring

Timeliness indicators are currently mainly achieved through the start and end delays of tasks. The start delay monitors the delay between the start time of the ETL task configuration and the actual start time of the task; the end delay monitors the start time of the ETL task configuration Delay between the actual completion time of the task; these two indicator monitoring systems have been natively supported, and developers only need to open these two monitoring items.

3.2.2 Integrity indicator monitoring

Integrity indicators generally monitor whether the data in the target table is complete. The simplest detection method is to count the number of records in the specified partition of the target table. This detection can currently be implemented through the system’s custom indicator interface. The SQL logic implemented is as shown in the table below. Show

Table 3-1 SQL Implementation of Integrity Index Collection

Variables in SQL will be automatically replaced by specific variable values quoted during task scheduling. This indicator will be checked every time an ETL task is run, and its detection value will be evaluated by the evaluation rules to implement alarm actions according to the alarm rules. The follow-up plan implements some commonly used integrity indicators as native support of the system to improve the convenience of everyone's use.

3.2.3 Monitoring of Consistency Indexes

The monitoring of the consistency index is exactly the same as the realization of the integrity index. The current implementation still requires the user to write SQL scripts to achieve it. In the future, it is also planned to implement the commonly used consistency index as a native support method of the system.

3.2.4 Data statistical feature monitoring

This type of indicator is completely dependent on the business logic of the ETL task itself. It is essentially the detection of the business quality of a type of data. At present, the monitoring of this type of indicator requires the business to express the indicator collection logic through SQL, and the system will only provide The storage and management functions of these indicators. The following takes the degree of data tilt of the monitoring task as an example to illustrate the realization of this type of monitoring index. Assuming that the data tilt is measured by the degree of deviation of the data distribution from the uniform distribution, the tilt index can be defined as (the formula comes from the cross-entropy of information theory)

Table 3-2 SQL implementation of data statistical characteristic index collection

After obtaining this inclination rate, you can detect whether the degree of data inclination is within a reasonable range according to the threshold configured in the evaluation rule, and make corresponding alarms.

3.3 Data quality index evaluation interface

The data quality indicator evaluation interface is used to evaluate whether the collected data indicators meet the data quality requirements. The interface mainly accepts two parameters, namely, the evaluation rule and the threshold range. The evaluation rule determines how the collected index value can be converted into an evaluation value. The threshold range determines the reasonable range of the evaluation value. Once the evaluation value exceeds the threshold range, the data quality is regarded as unqualified.

According to business needs, you can choose different evaluation rules. In order to improve the convenience of the system, the system has built-in three evaluation rules: transparent transmission, year-on-year, and ring-on-quarter. The business can also customize the evaluation rules through sparksql according to business needs.

Figure 3-3 Evaluation rules and alarm settings for quality inspection

3.4 Data quality check action interface

The data quality check action interface configures the measures to be taken when the data quality check fails, and currently supports three actions: ignore, alarm, and block. Flicker means that no operation is performed, and the inspection task and ETL task are running normally; alarm means that an alarm message is sent by nailing, SMS or phone according to the configured alarm template, but the inspection task and ETL task continue to run normally without being affected; blocked Indicates that an alarm message is sent by nailing, SMS or phone according to the configured alarm template, and the ETL task and its quality check task are terminated. The ETL task is deemed to have failed, and subsequent tasks that depend on the task cannot continue to be executed. , The task is considered successful after manual intervention is required to repair the data and pass the data quality check.

3.5 Data quality alarm configuration interface

The data quality alarm configuration interface mainly configures the recipient of the alarm, the receiving channel and some other advanced settings related to the alarm

04 Summary

The data quality monitoring project provides a unified platform for the monitoring of business data quality, improves the quality assurance of business data from passive notification to the level of active discovery, and provides a means of participation for the business itself to participate in the data quality assurance work.

At the same time, in order to further improve the guarantee of data quality, the data quality monitoring project is constantly being enhanced and improved. The following aspects of planning and improvement are mainly:

1. Continue to provide more predefined collection indicators and predefined evaluation indicators

2. Support more implementation methods of custom indicators, such as supporting custom indicators through python

3. By using the dependent blood relationship of ETL tasks, the scope of impact of data quality problems can be further evaluated, different alarm levels will be given according to the scope of impact, and relevant business personnel and developers will be notified in advance according to the affected business line to do related coordination work .

Aurora Notes丨Data Quality Construction Practice

01 Summary

02 background

03 Design plan

04 Summary

极光JIGUANG

引用和评论

AIGC | 如何用“Flow”，轻松解决复杂业务问题

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Aurora Notes丨Data Quality Construction Practice

01 Summary

02 background

03 Design plan

04 Summary

极光JIGUANG

引用和评论

AIGC | 如何用“Flow”，轻松解决复杂业务问题

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈