Building an intelligent grayscale data system from 0 to 1: taking vivo game center as an example

author:
vivo Internet Data Analysis Team-Dong Chenwei vivo
Internet Big Data Team - Qin Cancan, Zeng Kun

This paper introduces the practical experience of vivo game center in the grayscale data analysis system, and provides a relatively complete set of intelligent grayscale data solutions from the four levels of "experimental ideas - mathematical methods - data models - product solutions" to ensure The scientific nature of the version evaluation, the project progress and the fast closed loop of the grayscale verification. The highlight of this program is the introduction of the root cause analysis method of index change and the design of the whole process automation product program.

I. Introduction

The game business has a large number of users, long business links, and complex data logic. As the core user product of the game business platform, the game center has frequent version iterations, and a small amount of grayscale verification must be performed before each version is launched. Since 2021, on average, important versions will start grayscale every 1-2 weeks, and sometimes multiple versions will be tested in grayscale online at the same time.

The whole process of grayscale mainly involves three issues at the data level:

How to ensure the scientificity of version grayscale evaluation?
How to improve the output efficiency of grayscale data and ensure the progress of the project?
When there is an abnormal indicator problem in the grayscale version, how to quickly locate the problem and complete the closed loop?

In the past two years, we have gradually systematically implemented the grayscale evaluation method on data products such as agile BI. At present, the grayscale data system has solved these three problems well. This paper starts with the basic concepts and development history of the version grayscale data system, and then takes " methodology + solutions " as the main line to describe the practice of the game center in the grayscale data system, and looks forward to the future.

2. Development of Grayscale Data System

2.1 What is Grayscale Release

When Game Center develops a brand-new homepage interface, how to verify whether the new homepage is accepted by users, and whether the functions are complete and the performance is stable?

Answer: Grayscale release. That is, before the new version is pushed to all users, select some users according to a certain strategy, and let them experience the new version of the homepage first, so as to obtain their opinions on "the new homepage is easy to use or not easy to use" and "if it is not easy to use, where is it from?" Problems" user feedback. If there is a major problem, the old version will be rolled back in time; otherwise, the gaps will be checked and filled according to the feedback results, and the scope of the new version will continue to be enlarged in time until the full upgrade.

2.2 Development stage of grayscale evaluation scheme

The key to judging whether the grayscale release is scientific lies in the control variables. The process of solving this problem is also the process of iteration and development of the grayscale evaluation scheme.

Stage 1: The comparison time is the same, but the difference in upgrade speed means that users who upgrade first and users who do not upgrade are non-homogeneous users, and the impact of sample differences on data results differences cannot be avoided.

Stage 2: It is ensured that the comparison groups are the same, but user behavior may change over time, and the difference between the time factor before and after cannot be eliminated.

Stage 3: At the same time, it ensures the same time and crowd, and has the following three advantages:

The old version is packaged as a comparison package, and released to two batches of homogeneous users together with the new version of the grayscale package, ensuring that the sample attributes and time factors of the grayscale package and the comparison package are consistent;
Calculate a reasonable sample size according to the product target to avoid unreliable results due to too few samples and waste of resources due to too many samples;
Relying on the silent installation function to quickly upgrade, shorten the time of the grayscale verification stage.

2.3 Contents of Grayscale Data System

The grayscale data system usually involves two parts: the early-stage traffic strategy and the later-stage data inspection .

The former includes sample size calculation and gray-scale duration control, while the latter includes the comparison of core indicators between new and old versions, indicators changes in product optimization, or data performance of new functions. In addition to the conventional grayscale evaluation, the introduction of root cause analysis can improve the interpretability of grayscale results.

2.4 The practice of vivo game center

We built the " Game Center Intelligent Grayscale Data System ", and gradually solved the three problems mentioned at the beginning of this article through three iterations. The data system consists of thematic dashboards such as index test results, dimensional drill-down interpretation, user attribute verification, and index anomaly diagnosis, as well as automatically pushed grayscale conclusion reports.

After the complete solution was deployed and launched, it basically realized the closed loop of automatic data production, effect inspection, data interpretation and decision-making suggestions in the grayscale evaluation stage, which greatly released manpower.

3. Methodology in Grayscale Data System

Before introducing the data scheme design, first introduce the background knowledge and methodology involved in the grayscale data system to help you better understand this article.

3.1 Grayscale experiment

The grayscale experiment includes two parts: sampling and effect testing , which correspond to the idea of hypothesis testing and the verification of historical differences of samples.

3.1.1 Hypothesis testing

Hypothesis testing is to first put forward a hypothesis value for the population parameters, and then use the sample performance to judge whether the hypothesis is true.

3.1.2 Sample historical difference verification

Although grayscale has been sampled through the hash algorithm in advance, due to the randomness of sampling, the historical difference of the sample is generally verified at the same time as the statistical test and the effect test, and the index fluctuation caused by the difference of the sample itself is eliminated. The grayscale period is usually 7 days, and we adopted a 7-day sliding window sampling method.

3.2 Root cause analysis

Grayscale indicators are often associated with multi-dimensional attributes (such as user attributes, channel sources, page modules, etc.). When there is a significant difference in abnormality in the test results of the indicators, it is a key step to locate the root cause of the abnormality in order to eliminate it. However, this step is often challenging, especially when the root cause is a combination of multiple dimensional attribute values.

To solve this problem, we introduce the method of root cause analysis to make up for the lack of interpretability of grayscale test results . We combine the indicator logic analysis method and the Adtributor algorithm to ensure the reliability of the analysis results.

3.2.1 Index logic analysis method

Since the index system constructed in the grayscale experiment is basically a rate-value index or an average index, both types of indexes can be decomposed into two factors, numerator and denominator through the index formula, and the numerator and denominator of the index are determined by each The dimension values under the dimension are added together. Therefore, the index logic analysis method is proposed, based on a certain disassembly method, the index value is logically disassembled from the two levels of index factor and index dimension.

3.2.2 Adtributor algorithm

In addition to the more common dimensional drill-down method of root cause analysis, we introduced the Adtributor algorithm to better deal with the situation of multi-dimensional combined impact indicators, and through the cross-validation of the two methods to ensure the reliability of the analysis results.

The Adtributor algorithm is a multi-dimensional time series abnormal root cause analysis method proposed by Microsoft Research in 2014, which has good reliability in the scene of multi-dimensional complex root causes. The complete process of the algorithm includes four steps: data preprocessing, anomaly detection, root cause analysis and simulation visualization. We mainly draw on the method of root cause analysis.

4. Grayscale intelligent solution

4.1 Overall Framework

The version grayscale can be divided into three stages: before grayscale, middle grayscale, and after grayscale. The overall framework of productization is as follows:

4.2 Process Design

Based on the above framework, how do we design and implement it?

Below is a flow chart describing the entire process:

4.3 The core content of the scheme

4.3.1 Sample size estimation scheme

Kanban: Under multiple sets of confidence levels and test performance criteria (95% confidence and 80% test power are displayed by default), based on the recent performance of the indicators, it is estimated that the indicators can be detected significantly under different expected changes. Minimum sample size.

The program has 3 major features:

Output multiple sets of standards, flexibly adjust the expected range;
Automatically select the latest full version of data as data input;
The mean index and rate index adopt differentiated calculation logic.

4.3.2 The significance test scheme of the effect index

The question that the indicator significance test model needs to answer is: compared with the comparison version, whether the indicator changes are statistically confident or unconvincing.

At present, the significance judgment of the gray version and the comparison version on 20 business indicators under three confidence levels has been realized.

The implementation process is as follows:

rate value index

 ... ...
 
# 已得以下指标数据
    variation_visitors  # 灰度版本指标分母
    control_visitors  # 对比版本指标分母
    variation_p  # 灰度版本指标值
    control_p  # 对比版本指标值
    z  # 不同置信水平(90%/95%/99%)下的z值，业务上主要关注95%置信水平下的显著检验结果
 
# 计算指标标准差
    variation_se = math.sqrt(variation_p * (1 - variation_p))
    control_se = math.sqrt(control_p * (1 - control_p))
 
# 计算指标变化值和变化率   
    gap = variation_p - control_p
    rate = variation_p / control_p - 1
 
# 计算置信区间
    gap_interval_sdown = gap - z * math.sqrt(math.pow(control_se, 2) / control_visitors + math.pow(variation_se, 2) / variation_visitors)  # 变化值置信区间下界
    gap_interval_sup = gap + z * math.sqrt(math.pow(control_se, 2) / control_visitors + math.pow(variation_se, 2) / variation_visitors)  # 变化值置信区间上界
    confidence_interval_sdown = gap_interval_sdown / control_p  # 变化率置信区间下界
    confidence_interval_sup = gap_interval_sup / control_p  # 变化值置信区间上界
 
# 显著性判断
    if (confidence_interval_sdown > 0 and confidence_interval_sup > 0) or (confidence_interval_sdown < 0 and confidence_interval_sup < 0):
       print("显著")
    elif (confidence_interval_sdown > 0 and confidence_interval_sup < 0) or (confidence_interval_sdown < 0 and confidence_interval_sup > 0):
       print("不显著")
... ...

mean index

 ... ...
 
# 已得以下指标数据
    variation_visitors  # 灰度版本指标分母
    control_visitors  # 对比版本指标分母
    variation_p  # 灰度版本指标值
    control_p  # 对比版本指标值
    variation_x  # 灰度版本单用户指标值
    control_x  # 对比版本单用户指标值
    z  # 不同置信水平(90%/95%/99%)下的z值，业务上主要关注95%置信水平下的显著检验结果
 
# 计算指标标准差
    variation_se = np.std(variation_x, ddof = 1)
    control_se = np.std(control_x, ddof = 1)
 
# 计算指标变化值和变化率   
    gap = variation_p - control_p
    rate = variation_p / control_p - 1
 
# 计算置信区间
    gap_interval_sdown = gap - z * math.sqrt(math.pow(control_se, 2) / control_visitors + math.pow(variation_se, 2) / variation_visitors)  # 变化值置信区间下界
    gap_interval_sup = gap + z * math.sqrt(math.pow(control_se, 2) / control_visitors + math.pow(variation_se, 2) / variation_visitors)  # 变化值置信区间上界
    confidence_interval_sdown = gap_interval_sdown / control_p  # 变化率置信区间下界
    confidence_interval_sup = gap_interval_sup / control_p  # 变化值置信区间上界
 
# 显著性判断
    if (confidence_interval_sdown > 0 and confidence_interval_sup > 0) or (confidence_interval_sdown < 0 and confidence_interval_sup < 0):
       print("显著")
    elif (confidence_interval_sdown > 0 and confidence_interval_sup < 0) or (confidence_interval_sdown < 0 and confidence_interval_sup > 0):
       print("不显著")
... ...

The Kanban display is as follows:

4.3.3 Automatic root cause analysis scheme of negative indicators

The automatic root cause analysis solution for negative indicators in grayscale scenes includes four steps: abnormal movement detection, sample historical difference verification, index logic disassembly, and Adtributor automatic root cause analysis.

Among them, the Adtributor automatic root cause analysis can calculate the factor that contributes the most to the index change among the dimensions of the same level. We adapt the specific index business scenarios by layering the index dimensions and setting the relationship, and build a multi-level attribution Algorithmic logic model, so as to realize the automatic output of root cause conclusions at the business level.

The Kanban display is as follows:

4.3.4 Grayscale report intelligent splicing push scheme

Automatic acquisition of version information content:

Obtain the version number, actual loading volume, cumulative days of release, and version-related content through the release platform, as the beginning of the grayscale report.

The conclusion is presented:

According to whether the indicators are all positive/partially negative/all negative, whether the sample is uneven, etc., various statistical results are automatically combined to correspond to the preset conclusion copy. A total of more than 10 conclusion templates are preset.

Interpretation of the significance test of core indicators (different types of indicators are interpreted according to different grayscale stages):

T+1～T+2: performance indicators, activity rate indicators
T+3～T+6: Active performance indicators, distribution performance indicators, download and installation process conversion indicators
T+7: Active performance indicators, distribution performance indicators, download and installation process conversion indicators, and subsequent conversion indicators

Drill down to the first-level module dimension attribution interpretation:

If the grayscale version has clearly input the change points specific to a first-level module in the early stage, it will automatically interpret the module and output the data of other modules with index differences; if the grayscale version has no input module-level change points, Interpretation conclusions of the first-level modules with significant output indicators (positively significant, negatively significant).

Sample size uniformity interpretation:

Business indicators are judged by the significance test to determine whether they are evenly distributed; non-business indicators are determined by distribution differences.

Interpretation of negative diagnosis:

According to the output results of the multi-level automatic root cause model, according to the modifiers mapped by different dimension types, the number of dimensions (single-dimensional/multi-dimensional), and the verification conclusion of the sample history difference, corresponding to different templates, the negative diagnostic copy is finally spliced .

5. Write at the end

For the requirements of scientific evaluation and rapid decision-making in the business grayscale release, we have combined a variety of methods to provide a relatively complete set of intelligent grayscale data from the four levels of "experimental ideas-mathematical methods-data models-product solutions". system solutions.

This paper hopes to provide a reference for the construction of grayscale data system in the business, but it should still be reasonably designed according to the characteristics of each business. The data model design involved in the scheme will not be introduced in detail here. Interested students are welcome to discuss and study with the author.

In addition, the grayscale data system still needs to be improved. Here are some of them that are already under research and solutions:

When gray-scale traffic is grouped, a random grouping method is usually used. However, due to the completely random uncertainty, after grouping, the two groups of samples may naturally be unevenly distributed in some index characteristics. Compared with the post-sample uniformity verification method, stratified sampling can also be considered to avoid this problem;
In the process of grayscale index analysis, the automatic multidimensional root cause analysis model still has room for improvement. At present, the model is very dependent on the comprehensiveness of the dimensions in its own data source, and can only detect quantitative reasons. In the future, it is hoped that the quantitative root cause model will be combined with qualitative factors for a more comprehensive and accurate interpretation;
The current grayscale solution in Game Center is essentially based on the 2 sample-test test model, but the model needs to estimate the minimum sample size in advance based on the expected improvement in the core indicators of the grayscale version compared to the comparison version. In the grayscale process, the core indicators may not meet expectations. In the future, testing methods such as mSPRT can be tried to weaken the limitation of the minimum sample size on significant results.

references:

Mao Shisong, Wang Jinglong, Pu Xiaolong. Advanced Mathematical Statistics (Second Edition)
It's Lao Li, that's right. "Five minutes to master the principle of AB experiment and sample size calculation". CSDN blog
Ranjita Bhagwan, Rahul Kumar, Ramachandran Ramjee, et al. "Adtributor: Revenue Debugging in Advertising Systems"