Quality Model and Practice of Meituan Integrated Business Recommendation System

Recommender systems are effect-oriented data application services, with a long spectrum of “good” and “bad” effects between the “yes” and “no” of functions. This paper establishes a quality model based on the granularity of user requests, associates data tables, algorithm models, system services and user requests through data kinship, and expands and generalizes with the practice of Meituan's comprehensive business, hoping to help or inspire everyone. .

1 Introduction

Meituan's comprehensive business (hereinafter referred to as Daozong) is one of the important segments of Meituan's in-store business, covering bathing, KTV, beauty industry, medical beauty, parent-child, marriage, sports and fitness, play, education and training, home furnishing, pets , bars, life services and other dozens of key sub-sectors to meet the diverse local life needs of hundreds of millions of users. The recommendation system is an important link to achieve efficient matching of supply and demand, and it is an export for transmitting data value, and the quality of the recommendation system determines the loss of the matching effect. As shown in Figure 1 below, the data is processed by data warehouses, processed by algorithms, and then served to various business systems through data services. Finally, it is transferred back to data warehouses through client-side embedding, forming the "flywheel effect" of data, and the quality is precisely this. The key point of gear meshing in the chain is an important prerequisite for improving efficiency and guaranteeing effect.

Quality assurance must be carried out around measurement, in order to be "visible", "clearly understood" and "correctly corrected". However, the traditional background service quality indicators cannot well describe the quality of the current "data flywheel". We hope to provide a new thinking angle and practical reference for similar multi-line of business, effect-oriented system quality measurement through the construction of the quality model of the integrated business recommendation system.

图1 推荐系统的“数据飞轮”

2 Analysis of the current situation

The recommendation system is an effect-based system, and its quality characteristics are different from those of a functional-based system. The functional system generally affects the user experience more obviously after being downgraded, but it is difficult for the user to perceive clearly if the recommendation result returns A or A'. But in fact, if the matching effect becomes poor, it will directly affect the user's hidden experience and needs to be identified. Functional systems generally build a quality indicator system with usability as the core. In the business practice of integrated business recommendation systems, we found that indicators such as usability have the following limitations:

Availability is insensitive to some defects : Availability is a function of the frequency and duration of outages and reflects the ability of a system to continue to provide services. As long as the defects of the system do not affect the external service, it does not affect the usability, but some actually affect the user experience. Defects here may be expected (such as active degradation) or unexpected (delayed model update) and should be incorporated into the measure of quality.
Availability is difficult to cover the entire link of data : The link of the recommendation system covers data production, processing, application, analysis and other links. One is that usability does not relate to the quality of data tables, and the other is that the full picture of data quality cannot be reflected where performance metrics are available. Data quality needs to consider characteristics such as completeness, accuracy, timeliness, security, etc., beyond the scope of usability. The internationally renowned scholar Wu Enda once said that 80% of the value of artificial intelligence depends on data, and the quality of the recommendation system delivered by the recommendation system (click conversion rate, transaction conversion rate, user stay time, etc.) also mainly depends on the quality of data.
Availability is difficult to reflect business differences : Meituan Daozong covers hundreds of industries and dozens of channel pages. The recommendation system cannot completely isolate services due to efficiency and cost considerations. The series-parallel calculation method of availability is difficult to distinguish between services and conduct separate Evaluation. To sum up, different services are very different, with different access frequencies, peak traffic periods, and business strategies, so the characteristics of quality and the distribution of problems are also different. The current availability indicators lack business dimension information, which is not conducive to guiding refined quality operations.

In the quality construction, the failure level was used as the goal in the past, the verification period was long, and it was contingent, and the logical derivation relationship between the goal and the action was not strong. In addition, the failure itself is an afterthought, and this kind of problem-driven thinking is not conducive to continuous operation. In general, aiming at usability, there are various problems in the actual implementation of calculation, so we consider building a quality model of the recommendation system, based on usability, and then adjust the calculation method to guide the refined quality operation.

3 Construction ideas

3.1 Quality in a business context

To build a quality model, first return to the understanding of the essence of quality. According to the definition of the International Organization for Standardization (ISO), quality is the sum of characteristics that reflect an entity's ability to meet explicit or implicit "needs". Another commonly used quality concept is stability, which at its core is keeping a system operating in an "expected" state for an extended period of time. Whether it's quality or stability, it's important to figure out whose needs and expectations the system needs to meet. In the recommended scenario, this object is the product and the algorithm. Business products understand user scenarios, abstract user needs, and propose product requirements to the recommendation team, which is reflected in external product iteration; at the same time, the recommendation system team cooperates with each other to learn the best optimization model strategy, which is reflected in the algorithm iteration within the data team.

As shown in Figure 2 below, in the calculation formula of availability, a long period of time is emphasized, while "needs" and "expectations" are only reflected in the provision of external services. There is a certain rationality here. First, availability is a common indicator in the industry, and the definition must be generalized, so the commonality and bottom line of quality is to provide services to the outside world; second, most of the back-end systems deliver functions, and most of the services provided to the outside world are in the "there are" " and "none", there is also a certain space for service degradation. But for the recommendation system with effect as the core goal, there is a long spectrum of effect "good" and "bad" between the function "yes" and "no". Our iterative thinking on the quality of the recommendation system, the core change is to change from "yes" and "no" in providing external services to "good" and "bad" in providing external services, which is also the starting point for the transformation of the availability calculation method.

图2 对缺陷的认知影响质量度量

3.2 Defect consideration and selection

Failure to meet "needs" or "expectations" will result in defects, which are the cause of quality loss. ISO/IEC 25010 Software Quality Model (2011) The software quality model defines software defects, which can be regarded as a complete set of defects, including functional applicability, performance efficiency, compatibility, usability, reliability, security, maintainability , Portability 8 features and 31 sub-features. There are some quality features that are not involved in background services (user interface aesthetics, learnability, etc.), and there are some prominent elements that do not constitute C-side quality in current cognition (modularity, coexistence, non-repudiation, reusability) Wait). Combined with the business characteristics and high-frequency quality problems of the recommendation system, at this stage, we focus on the quality characteristics shown in Figure 3 below as the source of defects.

图3 推荐系统的质量特征

We found that traditional usability measures mostly focus on reliability, functional integrity, and correctness, but lack measures for most functional accuracy, appropriateness, and safety, which are closely related to the quality and effectiveness of recommendations. The effects of accuracy and appropriateness on the effect are more intuitive, while others are more indirect. For example, in security, taking crawler access in security as an example, because the crawler’s access behavior does not conform to the behavioral habits of real humans, it will affect the recovery of core indicators such as UVCTR, resulting in misjudgment of the effect; at the same time, if crawler data cannot be identified and eliminated, Noise can further affect the accuracy of model training. Data quality issues are the "poison pill" in the "flywheel effect" of data, which generates positive feedback and continuously amplifies defects. We will quantify the above shortcomings and expand the extension of usability in Chapter 4, Computational Rules.

3.3 Selection of Measurements and Calculations

Availability can be divided into measurement methods and calculation methods: measurement is what we often call N 9s, and calculation is measured by a function of mean time between failures and mean time to recovery. In terms of measurement methods, the quality measurement methods commonly used in the industry are shown in Figure 4 below:

图4 度量方式

The selection of several points for the measurement method is not the focus of the quality score at this stage. The N 9s used by the usability itself are also simple enough to be compared. We focus on the calculation method. Due to the large number of comprehensive business lines, the recommendation system is a platform-type product, and the relationship between the system and the business is N:N. The availability of the current system is difficult to calculate the availability of each industry, project and business. A traffic location can belong to the business of leisure and entertainment, to the project of script killing, to a part of the main path of core display, or to a type of content recommendation. Requests to aggregate computations are most appropriate. As shown in Figure 5 below, if availability is a function of request, it can either include the quality characteristics we cared about in the previous section, or count business-meaningful quality across multiple dimensions.

图5 从请求的角度度量质量

4 calculation method

According to the construction ideas in the previous chapter, from faults to defects, from "yes" and "no" in the recommendation results to "good" and "bad" in the recommendation effect, from the whole to each business, we describe a good quality should have characteristics. In this chapter, we focus on the calculation logic of indicators, select key defects, define "successful request response", and increase the business aggregation dimension of quality score.

4.1 Calculation formula

Combined with the quality characteristics described in Section 3.2, to evaluate the system quality from the perspective of the percentage of successful requests, it can be divided into the following four levels of defects in the actual calculation:

System level : If the request triggers a system exception, it is a defect response. Common ones are recall timeout, recall failure, recall empty result, etc.
Data level : If the data used in the request is abnormal, it is a defect response. Common ones, such as abnormal supply quantity and abnormal label distribution, etc., the actual impact of data on user requests depends on the establishment of data blood relationship and the evaluation of the impact.
Algorithmic level : In the process of recall and sorting, if the features, models, and strategies used by the request are abnormal, it is a defect response. Common ones, such as model update delay and missing features, affect the effect expression of recommendation.
Business level : The request triggers business suitability or security compliance requirements, and requests containing the above results are all defect responses. Common bad cases such as operational feedback include supply quality and content security.

A request that has experienced a defect in any part of the life cycle is defined as a defect response in the result, and the specific defect link is the dimension of drill-down analysis. We select typical problems (business pain points, high-frequency quality problems) from the quality characteristics in Chapter 3.2 and the four levels of the above-mentioned defects for calculation, as shown in Figure 6 below:

图6 质量分计算方法

4.2 Business Generalization

The business characteristics of the comprehensive recommendation system are multiple business lines, large industry differences, and many recommended material locations. This is reflected in the quality measurement. We need aggregation analysis at all levels to guide refined operations, as shown in Figure 7 below:

图7 各业务层次的聚合分析

There are many low-frequency and medium-frequency services in China, and the fluctuation of the ratio is greatly affected by the absolute value of the request. For these scenarios, some small traffic bits can be aggregated, and minute-level monitoring can only be performed at the industry or project level.

4.3 Indicator system

As shown in Figure 8 below, we treat a request responded by the recommender system as a product delivery behavior. The defect-free proportion of these requests is the quality score of the recommender system, which is the top-level quality output indicator. According to the life cycle of the request, the first-level input indicators can be established to measure the quality status of the core process, such as recall defect rate, sorting defect rate, etc. The first-level indicators can also be further disassembled to obtain the second-level input indicators. For example, when the recall defect rate is relatively high, the recall null rate and recall timeout rate can be measured. User requests can also be aggregated in vertical, horizontal, and time dimensions according to the business, and a quality score with business attributes can be obtained, which will be more targeted and focused.

图8 质量指标体系

This set of improved quality points, with requests as the basic unit, solves its limitations to a certain extent compared to the original availability calculation method: it is sensitive to defects, can include the impact of data links, and is easy to carry out. Aggregate analysis of multiple business dimensions.

4.4 Bloodline expansion

The quality score is counted at the granularity of the request. In the data application service, the request is only one of the forms of external data output. After completing the basic quality score, the life cycle of the request should be extended to the entire data link, so that the quality measurement is complete. At this time, relying on the blood relationship of the data, associate the data table-business system-C-side traffic to build a panoramic quality portrait, as shown in Figure 9 below:

图9 推荐系统的数据血缘

Consanguinity is the interpersonal relationship generated by marriage and reproduction in human society, such as the relationship between parents and children, the relationship between brothers and sisters, and other kinship relationships derived from it. Data can also be merged and transformed to generate data. The blood relationship of data is divided into different levels of database, data table, and field. It is generally used for data assets (reference heat calculation, understanding data context), data development (impact analysis, attribution analysis), data governance (link status tracking, data warehouse) Governance), data security (security compliance inspection, label dissemination) four aspects. Under the current idea of recommending the quality score of the system, the impact analysis is mainly used to expand the quality score, marking all requests of faulty nodes, and deducting the corresponding score.

Under the business semantics of the recommender system, we define six types of business metadata: snapshots, schemes, components, indexes, models, and features. Based on the metadata, we build bloodlines, which can be divided into task access, bloodline analysis, and data export. Task access is divided into acquisition module and warehousing module. After the task access is completed, the relationship between nodes and nodes will be stored through the graph database, and the blood relationship will be established by using the graph algorithm. After the blood relationship is established, the abnormality of the node itself supports system discovery and manual marking, and the impact analysis can be completed automatically. When an abnormality occurs in a node, a message is notified, and the abnormal information will spread along the bloodline, which in turn affects the quality score calculation of the downstream links.

When the abnormality spreads to the user side, we try to describe the loss in business language. According to the comprehensive income model, the value of each intentional UV of each business line can be calculated (users visit the merchant details page and the group order details page is called the intentional visit), and then use the week-on-week visit situation of the traffic to automatically derive the business loss.

5 Indicator Operation

5.1 System Implementation

The systematic implementation of quality scores relies on tracking and diagnostics. It is recommended that the entire link includes parameter input, recall pre-processing, recall, recall post-processing, rough sorting, fine sorting, and rearranging. Each link may fail, so data collection needs to cover runtime exceptions. , the key input and output information of each link, etc. As shown in Figure 10 below, we asynchronously collect buried point data through Kafka, and then perform data processing in different scenarios: in the production environment, near real-time ES builds indexes, provides fast query services for nearly 4 days, logs 4 days ago are archived in Hive, and in addition The buried point data is parsed through the Flink engine, and after necessary diagnosis, the score is calculated in real time and alarm information is pushed; in the test environment, the logs are sorted to MySQL in real time, which is convenient for testing and troubleshooting. Finally, a structured presentation of the quality of the recommendations at different stages improves the readability of the results.

图10 质量分的系统实现

The improvement of the score system needs to be advanced gradually. For the recommendation system, no recommendation result is the most serious quality problem. We first collect and calculate the recommended null results, corresponding to the result defect rate and recall defect rate in the first-level indicators, and the result null rate and recall null rate in the second-level indicators. At the same time, due to the business characteristics of the comprehensive industry, there are many industries, and the supply is unevenly distributed in time and space. A large number of crossable screening conditions will also have empty results, which will affect the calculation of the quality score.

How to eliminate empty results that meet business expectations and eliminate quality noise. Diagnosis becomes very important based on the realization of buried points. Taking the empty result as an example, we mainly identify from the three links of parameter diagnosis, data diagnosis and link diagnosis. Among them, data diagnosis refers to when the online filter condition has an empty result, back to the source to verify the underlying data for a second time, and check whether the bottom table data is empty. If there is really no relevant supply in the bottom table, set the alarm-free rule and set the alarm-free validity period. For a period of time, the current industry in the current city does lack relevant supply, and the empty result will not be included in the quality score calculation. If there is a supply of the bottom table, it means that there is an abnormality in the data processing or service process, which makes it impossible to recall.

How to establish a rule matching mechanism (ie, a rule engine) is the key to diagnosing an engine. There are many choices of current rule engines, such as EasyRule, Drools, Zools, Aviator, etc. According to the above analysis, the diagnosis engine needs to be able to perform rule diagnosis on request parameters, recommended links, and underlying data. For the diagnosis of request parameters and recommended links, memory parameters can be used for diagnosis, while data diagnosis needs to obtain information from third-party storage, so there must be some customized development. Considering the maturity and convenience of using human tools, the Aviator representation engine is more suitable. In order to fit the content that needs to be diagnosed, the designed expression diagnostic primitives are as follows:

 //参数诊断-原语表达
//是否符合一定参数的诊断原语
global:check=aviator[cityId !=nil && include(string.split('1,2,3,4,5,6,7,8,9,10,16,17',','),str(cityId))]

//链路诊断-原语表达
//1、召回异常诊断原语
global:recallException=param[${recall#exception#}],
global:check=aviator[recallException!=nil && recallException !='' ]
//2、召回空无异常的诊断原语
global:recallEmpty=param[${recall#after#}],
global:check=aviator[recallEmpty!=nil && recallEmpty !='' ]
//3、召回不为空，过滤规则执行后为空的诊断原语
global:recallEmptyCode=param[${recall#after#}],
global:predictFiltersEmptyCode=param[${predict#after#filters#}],
global:check=aviator[(recallEmptyCode ==nil || recallEmptyCode =='')  && predictFiltersEmptyCode !=nil]
//4、执行某一具体过滤规则后，导致无结果的匹配
global:filterEmptyCode=param[${PredictStage#filter#after#_compSkRef#}],
global:check=aviator[filterEmptyCode !=nil  && filterEmptyCode =='deleteItemByConditionalFilter' ]

//数据诊断-原语表达（判断底层是否有数据，若没有则为true，否则为false）
global:keys=keySpread[@prefix 138_ymtags_][@crossOrder city_${cityId}_platform_${platformNo}_surgery_prj_${genericLvlIds}],
global:cnt=cellar@cellar[@count ${keys}],
global:check=aviator[cnt !=nil && cnt !='' && long(cnt) <= 0 ]

5.2 Alarm follow-up

Quality points can be used for real-time monitoring and operational review, and team members are required to follow up on changes in a timely manner. Generally, the general alarm system of the company configures the alarm recipients based on the granularity of the service name. Platform-based services such as recommendation systems provide services through a unified interface, but the model strategy is maintained by different students, and there are certain industry knowledge and understanding thresholds between businesses. The default broadcast alarms can easily cause alarm storms. Everyone cannot focus on the problems of their own modules, and sometimes the alarms are missed.

For the consideration of the follow-up rate (as shown in Figure 11 below), we developed a follow-up function based on the existing alarms, routed the alarms of specific traffic bits to the dedicated person in charge, and recorded the flow of follow-up status, so as to be informed in a timely manner. and replay afterwards. In terms of operations, we build a quality dashboard through data reports and regularly review the quality fluctuations of different businesses.

图11 告警跟进流程

5.3 Governance effect

The quality score is based on the result null value rate, and the recall null value rate, model prediction null value rate, and rearrangement operator null value rate are collected according to the process dismantling, and aggregated into platforms, services, forms, projects, Traffic in multiple dimensions. Governance actions and outcomes are divided into the following areas:

Through tracking and diagnosis, it can be judged whether the current empty results are a supply problem or a quality problem, and 98% of the empty results are excluded from the calculation of the quality score to avoid false alarms. The daily average number of empty results alarms has been reduced from 40 to 5.
Based on the analysis of the null rate of each link in the link process, take governance measures, including data specification (data layer standardization, label marking specification), service architecture (service isolation, underlying data dual media, downgrade), change specification (configuration On-line pipeline inspection, traffic playback), to keep the system discovery rate of empty results above 60%.
Customized development of alarm routing, to avoid alarm broadcast, support to mark the follow-up status, the alarm follow-up rate of empty results can not be counted, to 100% follow-up to the core traffic.

After the management and identification of null results, the current core traffic bit null rate is 0.01%, which means that 99.99% of the core traffic bits are guaranteed to have results. While building quality points, the system discovery rate and alarm follow-up rate are guaranteed.

5.4 Asset Precipitation

The recommendation system conveys the value of data. Only when data is capitalized can this value be sustainable and add-value. In the process of building the quality model of the recommendation system, in fact, it is also doing the precipitation of data assets. After data is collected, it becomes an asset. Generally, the following four conditions must be met: flowable, measurable, controllable, and value-added, all of which are covered in the calculation method in Chapter 4.

The process of index operation is also the process of precipitating quality knowledge assets. How exactly does the software defect model affect the final product delivery quality, whether there is correlation and causality between them, and whether this effect is explicitly involved in the score calculation or indirectly. In the process of quality sub-operation, we can gradually fill in the quality map in our minds and form the topological relationship between indicators, defects, indicators and defects, which is a process of capitalizing quality. For example, through the business practice of recommendation systems, we found that 80% of online failures are caused by publishing, and 80% of publishing failures are caused by data publishing, which can guide us to reduce online failures by managing data publishing.

6 Future plans

Based on usability, we adjusted the calculation method, established a multi-level recommendation system quality score, and expanded it to various recommended materials and various business modules. The cognitive iteration of "good and bad" is also the basis for refined quality operations. In the follow-up planning, on the one hand, we will continue to enrich the calculation and link coverage of the quality model; on the other hand, we will do more quality management work based on the quality model. Some of the directions we will focus on thinking and iterating in the future include:

By improving the buried points and diagnosis, the indicators at each level in the quality sub-system will be gradually implemented, the connotation of the quality sub-system will be enriched, and more quality problems will be accommodated.
By building multi-level recommendation flexible downgrades, iteratively understands the quality score, and quantifies the impact of different downgrades on the system.
Optimize the accuracy, coverage and timeliness of data lineage, and more accurately and quickly assess the impact of quality problems in a certain link.

7 Author of this article

Yong Hao, Gengen, Wang Xin, He He, Li Cong, etc. are all from Meituan Daodian Platform Technology Department/Daozong Business Data Team.

Read more collections of technical articles from the Meituan technical team

| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.

Quality Model and Practice of Meituan Integrated Business Recommendation System

1 Introduction

2 Analysis of the current situation

3 Construction ideas

3.1 Quality in a business context

3.2 Defect consideration and selection

3.3 Selection of Measurements and Calculations

4 calculation method

4.1 Calculation formula

4.2 Business Generalization

4.3 Indicator system

4.4 Bloodline expansion

5 Indicator Operation

5.1 System Implementation

5.2 Alarm follow-up

5.3 Governance effect

5.4 Asset Precipitation

6 Future plans

7 Author of this article

美团技术团队

引用和评论

可信实验白皮书系列04：随机轮转实验

2027倒计时：5个关键数据揭秘100%国产替代实施路径

通过阿里云Milvus与通义千问VL大模型，快速实现多模态搜索

ClkLog埋点分析系统支持自定义SQL 查询

MTGR：美团外卖生成式推荐Scaling Law落地实践

OR算法+ML模型混合推理框架架构演进

数据要素如何驱动产业数字化转型？