How to ensure the quality of data warehouse data?

Guided reading

Youzan Data Report Center provides merchants with rich data indicators, including 30+ pages, 100+ data reports and 400+ different types of data indicators, which help merchants operate stores more rationally and scientifically, and also directly provide analysis and decision-making methods For business use. Moreover, the low-level tasks and data tables involved in running every day have reached thousands of levels.

Faced with such a huge data system, how to formulate a quality assurance strategy as a test? This article will start from four aspects: 1. Youzan data link, 2. Data layer testing, 3. Application layer testing, and 4. Follow-up planning.

Data warehouse construction nanny level tutorial PDF document

1. Youzan data link

1. Data link introduction

First of all, I will introduce the overall data structure diagram of Youzan:

From top to bottom, it can be roughly divided into application service layer, data gateway layer, application storage layer, and data warehouse. Platforms such as job development and metadata management provide basic capabilities for data computing, task scheduling, and data query.

The above has made a preliminary introduction to the overall architecture. For quality control, the two core parts are: the data warehouse and the data application part . Because these two parts belong to the core links in the data link, compared with other levels, daily changes are more frequent, and the risk of problems is also relatively large.

Second, the data layer test

1. Overall overview

First, the quality assurance of the data layer can be divided into three aspects: data timeliness, integrity, and accuracy.

2. Data timeliness

Data timeliness, as the name suggests, means that test data needs to be produced on time. The three key elements of timeliness are: timing scheduling time, priority and data deadline . The priority of the task determines how much data computing resources it obtains and affects the execution time of the task. The data deadline is a unified standard for the latest data output time, which needs to be strictly followed.

Among these three elements, the one that belongs to the " universal rule " and needs to be focused on in the quality assurance stage is the data deadline. Then, based on the data deadline, the guarantee strategies for timeliness can be divided into two types:

Monitors whether the offline data task is completed. This method relies on the monitoring and alarming of the Youzan job development platform. If the data task is not completed at the deadline, there will be alarms in the form of email, enterprise micro, telephone, etc. to notify the corresponding personnel.

Check the number of full table entries or check the number of partition entries. This method relies on the interface automation platform. By calling the dubbo interface, it is judged whether the data index returned by the interface is 0, and whether the monitoring data is output.

Secondly, we can pay attention to the number of failures and retries. When the abnormal situation of multiple failures and retries occurs during the execution of the task, an alarm can be thrown for the relevant personnel to perceive. This part of the alarm is a supplement to the deadline alarm, and there is also a function integration on the Youzan job development platform.

3. Data integrity

Data integrity, as the name suggests, depends on whether the data is complete or not, and focuses on two points: not much data, and not much data.

There is not much data: generally check the full table data and important enumeration values to see if the data is redundant, duplicated or if the primary key of the data is unique.
There is a lot of data: generally check the full table data, important fields (such as primary key fields, enumeration values, dates, etc.) to see if the value of the field is empty, null, etc.

It can be seen that the correlation between data integrity and the business itself is not so close, and more is the general content verification of the data warehouse table. So from some basic dimensions, we can split the test focus into two directions: table level and field level.

Table level integrity:

For the full table dimension, by checking the total number of rows/table size of the entire table, if the total number of rows/total size of the table remains unchanged or decreased, it indicates that there may be a problem with the table data.
For the partition dimension, you can check the number/size of data rows in the partitioned table on the current day. If the difference is too large (larger or smaller) compared with the previous partition, there may be a problem with the table data.

At present, Youzan metadata management platform has integrated related data views:

Field level integrity:

Uniqueness judgment: To ensure the uniqueness of the primary key or some fields, to prevent data duplication and data doubling after joining with other tables, resulting in large final statistics.

For example, to determine whether the order number in the ods layer order table is unique, write sql:

 select 
count(order_no)
,count(distinct order_no) 
from ods.xx_order

If the two are equal, it means that the order\_no value is unique in the table; otherwise, it means that the order\_no value is not unique in the table, and there is a problem with the table data.

Non-empty judgment: Ensure that important fields are not empty to prevent data loss after empty data and table join, resulting in less final statistics.

For example, to determine whether the order number in the order table of the ods layer is null, write sql:

 select 
count(*) 
from ods.xx_order 
where order_no is null

If the result is equal to 0, it means that there is no null in order\_no; if the result is greater than 0, it means that there is a null value in order\_no, and there is a problem with the table data.

Enumeration type judgment: ensure that the enumeration field values are within the expected range to prevent business dirty data, resulting in missing/superfluous data types in the final statistical results.

For example, to judge whether all the enumeration values in the shop\_type field in the order table of the ods layer meet the expectations, write sql:

 select shop_type from ods.xx_order group by shop_type

Analyze whether the query results meet expectations to ensure that there are no missing/redundant enum types.

Judgment of data validity: Judging whether the data format meets the expectations, and preventing the incorrect data format of the field from causing errors and missing data statistics. A common date format is yyyymmdd .

Once a data integrity problem occurs, it has a great impact on data quality. Therefore, the integrity strategy is more suitable for the ods layer, because we expect to find and solve the problem of unreasonable data from the source, stop the loss in time, and avoid the expansion of data pollution after dirty data enters the downstream.

In addition, we see that the logic of the integrity check content is simple and relatively fixed, and it can be templated with a little simple abstraction. Then as a test, we are more inclined to make data integrity verification tools. At present, Youzan "Data Shape Tool" has been implemented. Here are some of my ideas:

For all tables, universal rules, such as the uniqueness of the table's primary key.
For different types such as numeric, String, enumeration, and date format types, common data judgment rules are listed.
Classify each rule. For example, if the primary key of a table is not unique, record it as critical. The proportion of null values in String type fields is greater than 70%, which is recorded as warning.
According to whether the table data meets the above rules, a visual report is finally implemented, and testers can evaluate the data quality according to the report content.

4. Data accuracy

Data accuracy, as the name implies, the data must be "accurate". The concept of "accuracy" is relatively abstract, because it is difficult for us to use a strong logical judgment to explain how accurate the data is, and most of it exists in perceptual cognition. Therefore, accuracy testing is also a direction in which thinking is relatively divergent in the process of data quality assurance.

After summarizing, we can control the accuracy of the data from the aspects of field self-check, data horizontal comparison, vertical comparison, code review, etc. These test points are also closely related to the business.

4.1 Self-check

Self-checking of data refers to checking the accuracy of data with its own data without comparing it with other data, which is the most basic kind of check. Common self-checks include: checking that the numerical index is greater than 0, and the ratio index is in the range of 0-1. Such basic rules, like data integrity, can also be combined with "data shape tools" to assist in testing.

For example, for the order table, the payment amount must be greater than or equal to 0, and there will be no negative numbers. Write sql:

 select 
count(pay_price) 
from 
dw.dws_xx_order 
where par = 20211025 and pay_price<0

If the result is 0, it means that the payment amount is greater than 0, which meets expectations; otherwise, if the count result is greater than 0, it means that there is a problem with the data.

4.2 Horizontal data comparison in the table

Horizontal comparison within a table can be understood as two or more fields in the same table that are related to business, and they have a certain logical relationship, so they can be used for data comparison.

For example, for the order table, according to the actual business analysis, it is easy to obtain: for any product in any store, the number of orders >= the number of people who place orders, write sql:

 select 
kdt_id
,goods_id
,count(order_no)
,count(distinct buyer_id) 
from dw.dws_xx_order
where par = '20211025'
group by kdt_id,goods_id
having count(order_no)<count(distinct buyer_id)

If there is no record in the query result, it means that there is no number of orders\<number of orders placed, and the reverse means that number of orders>=number of orders placed, which is in line with expectations; otherwise, if the record in the query result is greater than 0, it is not in line with expectations.

4.3 Horizontal data comparison between tables

Horizontal comparison between tables can be understood as two or more tables with fields with business associations or the same business meaning, which can be used for data comparison:

Comparison between tables of the same type: For payment table A and payment table B in hive, there are payment amount fields in them, then table A. payment amount under the same dimension = table B. payment amount.
Comparison between multiple sets of storage: For example, Youzan data report center uses mysql and kylin for the payment table, and the application layer storage uses mysql and kylin respectively, which are used for the master-standby switch, then the kylin-table A. payment amount under the same dimension = mysql-table B .Payment amount.
Comparison between multiple systems: between systems, such as the data reporting center and crm system, both systems have customer indicator data, then the data reporting center under the same dimension - table A. customer indicators = crm - table B. Customer metrics.

We deeply analyze the underlying logic of the horizontal comparison of data. The essence is to compare the different fields of the two tables and compare the logical operators, which is also easier to abstract into a tool. At present, Youzan "Data Comparison Tool" has been implemented. Here are some of my ideas:

Enter two tables and set the primary keys of the two tables respectively.
Enter the fields to be compared in the two tables, and set the comparison operators, such as >, =, \<.
According to the set rules, the final data is compared with the records of passing and failing, and a visual report is implemented, and the tester can evaluate the data quality according to the content of the report.

4.4 Longitudinal data comparison

Longitudinal comparison is the comparison of upstream and downstream data, the purpose is to ensure that there are no problems in the upstream and downstream processing of important fields.

For example, the data warehouse dw layer has a detailed list of orders, and the data product dm layer has an aggregate table of the number of orders, so the statistical results of the two data under the same dimension should be consistent.

4.5 code review

First of all, in the requirements review stage before the code review, we must first clarify the detailed caliber of data statistics. Here are two actual requirements examples.

Requirement 1: (Error example) The payment amount of all users in the store during the statistical time. The problem is: the requirement description is too succinct, and the time dimension and filtering conditions of data statistics are not clearly explained, resulting in unclear statistical caliber, requiring products to have a clear caliber.
Requirement 2: (Correct example) There is an offline payment amount in the dimension of Zan entire network merchant domain store. Supports natural days, natural weeks, and natural months. During the statistical period, the sum of all payment orders (excluding lucky draw, gift cards, and distribution supply orders).

After clarifying the requirements, some common concerns of code review are described in detail below:

1) Association \& Filter Conditions

Whether the associated table uses outer join or join depends on whether the data needs to be filtered.
In the on clause of the association relationship, whether the left and right value types are consistent.
If the association relationship is 1:1, then whether the association keys of the two tables are unique. If not unique, then the association will produce Cartesian resulting in data bloat.
Whether the where condition is correctly filtered, taking the above requirements as an example, pay attention to whether the lottery group, gift card and distribution supply order are correctly excluded in SQL.

2) Statistical caliber processing of indicators

The statistics of data indicators involve two basic concepts:

Accumulative indicators: such as payment amount, pageviews, etc., indicators that can be counted by simply adding numerical values. For such indicators, the function used in sql is generally sum.
Non-cumulative indicators: For example, the number of visitors cannot be simply added, but needs to be deduplicated and then summed for statistics. For such indicators, count(distinct) is generally used in SQL.

3) insert insert data

Whether to support rerun. It is equivalent to seeing whether there is an overwrite keyword when inserting. If there is no such keyword, the dirty data will not be overwritten when the data is rerun (the workflow is executed multiple times), but the data will be incrementally inserted into the table, which may lead to the final result. Statistics doubled.
Whether the order of the inserted data is exactly the same as the order of the inserted table structure. We need to ensure that there is no error in the writing order of the data fields, otherwise the inserted values will be chaotic.

3. Application layer testing

1. Overall overview

The basic front-end page + server-side interface testing is the same as that of general business testing, and will not be repeated here. This article focuses on the areas where "data application" testing requires additional attention.

2. Downgrade strategy

When adding a new data sheet on the page, it is necessary to confirm whether the "blue bar" function needs to be supported in the requirements and technical review stage, which belongs to "test left shift".

Blue bar introduction: There is a blue bar at the top of the page where Youzan informs the merchant that the offline data has not yet been produced, and the "production time" = current access time + 2 hours, which is dynamically calculated.

When testing ratio-type indicators, focus on the special case where dividend = 0. Pay attention to this point during the back-end code review and test page function stages. At present, there are praises for this situation, and the front-end unified display is "-".

3. Active/standby strategy

When there is an active/standby switching strategy, please note that the data is normally double-written during the test, and through configuration, you can switch between the active and standby data sources when fetching data.

4. Data security

Pay attention to the permission management and control of data query, and focus on testing the scenarios of horizontal and vertical ultra-authority.

4. Follow-up planning

At present, in the data accuracy comparison of actual projects, the data comparison tool can only replace 50% of manual tests because it does not support sql functions for the time being. Some complex horizontal and vertical data comparisons still need to write sql. The follow-up plan supports sql functions such as sum, count, max, min, etc., and increases the coverage of the tool to more than 75%, greatly reducing the cost of data comparison.

At present, "Data Form Report" and "Data Comparison Tool" are being used in more project tests. In the follow-up plan, the form inspection and data comparison will be made into online inspections, and automation and data tools will be combined to continuously ensure the quality of the data warehouse table. .

At present, the method of sql code review mainly relies on manual work. We plan to make some basic sql checks, such as insert into check, uniqueness check of join on condition, field insertion order check, etc., into sql static scan and integrate them into the big data testing service , and empower other business lines.

refer to

Data warehouse construction nanny level tutorial PDF document

How to ensure the quality of data warehouse data?

Guided reading

1. Youzan data link

1. Data link introduction

Second, the data layer test

1. Overall overview

2. Data timeliness

3. Data integrity

4. Data accuracy

4.1 Self-check

4.2 Horizontal data comparison in the table

4.3 Horizontal data comparison between tables

4.4 Longitudinal data comparison

4.5 code review

3. Application layer testing

1. Overall overview

2. Downgrade strategy

3. Active/standby strategy

4. Data security

4. Follow-up planning

refer to

园陌

引用和评论

2023版最新最强大数据面试宝典

【数据仓库】维度建模

【数据仓库】数仓分层设计

Apache Doris 2.1.8 版本正式发布

面向 Data+AI 的新一代智能数仓平台

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

如何通过物化视图加速数据湖查询？