Exploration and practice of intelligent computing power of Meituan takeaway advertising (2)

In the era of deep learning, the demand and consumption of computing power are increasing day by day. How to reduce the cost of computing power and improve the efficiency of computing power has gradually become an important new topic. The purpose of intelligent computing power is to carry out refined and personalized distribution of traffic computing power, so as to maximize the business income under the constraint of system computing power. This article mainly introduces the technological evolution of Meituan’s takeaway advertising intelligent computing power from linear programming algorithm to evolutionary algorithm, and presents a multi-action computing power allocation scheme based on evolutionary algorithm, hoping to bring some help or inspiration to everyone.

1 Business Background

With the rapid development of Meituan's food delivery business, the pressure on the food delivery advertising system has become more and more serious, and computing power has become a new bottleneck. In the first half of 2021, several business lines of food delivery advertising began to experience insufficient computing power resources, and the efficiency of computing power allocation needs to be improved urgently. In the takeaway scenario, the traffic has an obvious double-peak structure, the advertising system faces greater performance pressure during peak hours, and there is a lot of redundancy in computing power during off-peak hours. The purpose of intelligent computing power is to carry out refined and personalized distribution of traffic computing power, so as to maximize the business income under the constraint of system computing power.

This article is the second in a series of articles on intelligent computing power in advertising. In the first issue of "Exploration and Practice of Intelligent Computing Power in Meituan Takeaway Advertising" ^[1] , we conducted a takeaway scenario for Ali's DCAF ^[2] linear programming solution. The optimization below has implemented the local optimal computing power distribution scheme for elastic queues (hereinafter referred to as "the first phase"). As shown in the figure above, in the takeaway display advertising link, the recall channel and model decision-making both use fixed strategies, and some of the benefits brought by high-quality traffic will be lost when the computing power is insufficient.

In this paper, we propose ES-MACA (Evolutionary Strategies based Multi-Action Computation Allocation), a multi-action computing power decision method based on evolutionary algorithms. On the takeaway advertisement link, three actions of elastic channel, elastic queue and elastic model are determined at the same time. In the post-action decision, we consider the state change caused by the decision of the pre-module, and at the same time use the multi-task model joint modeling to realize the system simulation (offline simulation + income estimation, realize the income evaluation function under different decision-making actions), Achieve optimal computing power distribution across the entire link. Compared with the content of the first issue, ES-MACA achieved the effect of CPM+1.x% and revenue +1.x% in the takeaway display advertising business line.

2 Overall idea

In order to cope with the great pressure of online traffic and huge candidate sets, the takeaway advertisement delivery system designs the entire retrieval process into a funnel-type cascade structure with successively decreasing candidate sets, mainly including recall, rough sorting, fine sorting, mechanism and other modules. In the first issue, we defined the means of computing power distribution as elastic actions , and combined the takeaway scenarios to summarize four actions, namely elastic queues, elastic models, elastic channels and elastic links. The specific actions are defined as follows:

Elastic queue : Online retrieval is a funnel process, and different value flows can be allocated different candidate queue lengths in each module of the cascade funnel.
Elastic model : In the model estimation service, models of different sizes can be selected for different value flows. The larger model has better prediction effect than the smaller model, but also consumes more computing power.
Elastic channel : In the recall scenario, different value flows can choose recall channels of different complexity and the number of recall channels.
Elastic link : On the retrieval link, different value flows can choose retrieval links of different complexity.

2.1 Formal description of computing power distribution problem

In a link containing M computing power decision-making modules, the goal of optimal intelligent computing power for the entire link can be generally described as: by intelligently deciding the computing power level of the M modules, the overall computing power satisfies the constraints Under the conditions, the overall traffic revenue is maximized .

A general formalization of the problem is described as:

The above is a scenario of multiple computing power decision-making modules. In takeaway display advertisements, the decision-making modules that are more sensitive to computing power and income are advertising recall strategy, fine-queue queue length and fine-ranking estimation model, which correspond to elastic channels and elastic queues respectively. and elastic model three actions.

In this issue, we also consider the joint decision-making of the computing power of the three modules of elastic channel, elastic queue and elastic model.

When multiple modules make joint decisions, the actions of different modules of the same request will affect each other. As shown in the figure below, the elastic channel decision result determines the real recall queue (including information such as the length of the candidate queue and advertisement type), which directly affects the input state of the elastic queue. Similarly, the decision result of the elastic queue affects the input state of the elastic model. Therefore, in the multi-action joint modeling, we add the feature of request "state", so that the decision-making action interacts with the system and better fits the process of the system state.

2.2 Challenge Analysis

In the first phase of the takeaway intelligent computing power, we conducted a series of explorations and improvements on the basis of the DCAF scheme for the takeaway advertising scenario, and made the first attempt to model elastic allocation, and achieved good returns. In recent years, the Ali CRAS ^[3] scheme presents a joint optimal computing power allocation linear programming scheme applied to the joint optimization of pre-scheduling, coarse-queuing and fine-queue queues. From the perspective of the classification of elastic actions, this scheme solves the joint optimization problem of three elastic queues in an elegant way. CRAS decomposes the original problem into three independent and similar sub-problems through some data analysis and reasonable assumptions. , and then solve the three sub-problems separately.

However, the existing solutions are based on linear programming and only focus on one or more elastic queue optimization problems. When faced with inelastic queue action combinations, such as elastic channels and elastic models, the solutions cannot be directly migrated. In particular, when the constraints or optimization objectives change, the linear programming solution needs to re-model and solve specific business problems, which consumes a lot of manpower; It often contains some strong assumptions related to business data. These assumptions may be difficult to meet in new business, which further makes it difficult to expand and migrate existing solutions to new business problems.

Due to the LBS limitation of food delivery scenarios, the candidate queue for food delivery advertisements is shorter than that of non-LBS e-commerce scenarios, and does not need to go through the complex pre-arrangement-coarse-arrangement-fine-arrangement process. On the whole link, we pay more attention to the computing power distribution of modules such as recall channel, fine-ranking queue length, fine-ranking estimation model, etc. These modules are actually more sensitive to computing power.

Overall, the challenges of optimal computing power allocation across the entire chain of the Meituan takeaway advertising scene mainly include the following two aspects.

Generalization problem

Challenge : Existing solutions are too coupled with business. On the one hand, when constraints or optimization goals change, the linear programming solution needs to re-model specific business problems; on the other hand, for specific business lines, it is often necessary to Add some strong assumptions based on business data characteristics. Takeaway advertising currently includes more than ten business lines, and each business line has multiple computing power decision-making scenarios. If each scenario of each business line is modeled separately, the labor cost is huge.
Countermeasures : adopt general solutions and accumulate basic general capabilities to empower different computing power decision-making scenarios of the advertising business, reduce costs and increase efficiency.

sequential decision problem

Challenge point : During the full-link computing power distribution, multiple decision-making modules are coupled with each other to jointly affect the final computing power and revenue of the current traffic. As shown in the figure below, after the pre-action decision is made, it is necessary to interact with the real environment to obtain the interaction result after the action decision. The system state transition is involved between modules, and the traffic revenue can only be obtained after the last decision module completes the decision, which makes It is difficult for us to model in a conventional way.

Countermeasures : In the process of modeling the optimal computing power distribution of the whole link, the "state" transition process of the system on each link is added, and the post-module makes decisions based on the decision-making results and request status of the pre-module.

Considering the above two problems, we model the problem of optimal computing power distribution in the whole chain of food delivery advertisements as a multi-stage decision problem (each decision module corresponds to a decision stage), and decide on the recall plan, truncation queue and pre-order in chronological order. Estimation model. In each stage, the agent interacts with the environment and makes decisions, and the agent parameters can be solved using evolutionary algorithms or reinforcement learning.

The whole-link computing power allocation process can be modeled as a Markov Decision Process (MDP) or a Partially Observable Markov Decision Process (POMDP). As shown in the figure above, the state transition occurs between two adjacent stages, and each stage has different candidate actions (such as recall strategy, truncation length, and estimated model number, etc.), and Reward occurs after the last stage action is executed. Obtained through system feedback.

We can collect online log data and use offline reinforcement learning (Offline RL) to solve the agent; we can also use online reinforcement learning (Online RL) to solve the agent without worrying about the loss of online revenue. However, due to the complex business scenarios, it is difficult to unify the computing power constraints at each stage. Whether it is offline reinforcement learning or online reinforcement learning, it is difficult to model and solve multi-stage strong constraints.

As a widely used and robust global optimization method, evolutionary algorithm has the following advantages:

Avoid local optimality : The evolutionary algorithm parameter search process has certain randomness, and it is not easy to fall into local optimality;
Parallelizable : The evolutionary algorithm parameter search process can be parallelized, which can alleviate the time-consuming problem of the evaluation process;
Wide range of applications : evolutionary algorithms can handle discontinuous, non-differentiable and non-convex optimization problems without requiring too much prior knowledge;
Simple and easy to use: Some evolutionary algorithms, such as the Cross-Entropy Method (CEM), can solve various constraint problems elegantly without directly solving the constraint problem.

The evolutionary algorithm can solve the problems in the takeaway advertising scene very well. It is not only easy to extend to other business lines, but also very convenient to model various decision-making problems. Therefore, in this issue, we choose an evolutionary algorithm to solve the problem of optimal computing power distribution across the entire chain of food delivery scenarios. In the follow-up work, we will try to solve it using a reinforcement learning scheme.

As shown in the iterative path (figure) in this section, we tried ES-SACA (Evolutionary Strategies based Single-Action Computation Allocation), a single-action computing power decision method based on evolutionary algorithm, in issue 1.5, and verified the evolutionary algorithm in the computing power allocation scenario. effectiveness. Next, this paper mainly introduces ES-MACA , a multi-action computing power decision method based on evolutionary algorithm.

3 Scheme design

In order to achieve the optimal computing power distribution on the entire link of the advertising system, we have designed the following decision-making scheme:

Offline training : randomly select decision-making agent parameters, replay historical traffic in batches, and interact with the advertising simulation system to complete the state transition process. According to the Reward returned by the system to optimize the decision-making Agent parameters, the offline optimal Agent parameters are finally output and synchronized to the online.

Online decision-making : For a single online request, use the offline optimal Agent to interact and make decisions with the online system.

In this issue, we use an evolutionary algorithm to solve the Agent parameters. The core of evolutionary algorithm parameter optimization is the evaluation of combined action value. Due to the state transition process involved, the evaluation of combined action value is no longer a simple supervised learning problem. The Agent needs to interact with the system in turn and execute decision-making actions until the final stage Only when the action is completed can you get the benefit from the system. A simple solution is to let the agent learn online and optimize its own parameters while interacting with the system, but online learning will affect business benefits, which is unacceptable to us. In order to solve this problem, we simulate the online advertising system environment by constructing an advertisement placement simulator, the simulator interacts with the Agent, and returns the reward.

3.1 Optimal computing power decision of the whole link

3.1.1 Problem modeling

According to the delivery scenario of takeaway advertisements, we model the whole problem based on the evolutionary algorithm as follows:

Status : Context characteristics, request queue characteristics, etc. (The status of the post-decision module depends on the decision of the front-end module. For example, the decision of the elastic channel directly affects the queue length of the elastic queue).
Action : Defined differently at different stages.
- Elastic channel: recall action, one-dimensional vector $(a_1, a_2, a_3, ...)$, $a_i \in \{0,1\}$ indicates whether the channel is recalled or not.
- Elastic queue: truncated length, integer value.
- Elastic Model: Model number, integer value.
Reward : The revenue target is business revenue. In order to ensure that the solution parameters meet the computing power constraints, the computing power constraints are added to Reward. For stricter constraints, the computing power coefficient $\lambda_n$ is larger.

3.1.2 Offline parameter solution

The offline parameter solution is mainly divided into two modules: evolutionary algorithm parameter optimization and reward evaluation.

Parameter optimization module : realizes the general evolutionary algorithm parameter search process, is responsible for parameter initialization, parameter evaluation (depending on the Reward evaluation module), parameter sampling and parameter evolution and other processes, and finally outputs the optimal parameters.
Reward evaluation module : According to the specific parameters of the specified Agent, batch playback of online traffic, let the Agent interact with the environment (offline simulation), and finally estimate the income corresponding to the current parameters according to the interaction results.

3.1.2.1 Parameter optimization

The parameter optimization module uses an evolutionary algorithm to solve for parameters. This article takes CEM as an example to explain the parameter solution process in detail:

Parameter initialization : Initialize the parameter mean and variance, and randomly sample N groups of parameters according to the specified mean and variance.
Reward evaluation
- Offline simulation: Play back the traffic, let the agent corresponding to the current parameters interact with the offline simulator, and complete the state transition process. After all module decisions are completed, the offline simulation module outputs the playback traffic interaction results.
- Revenue Estimate: Estimate the expected revenue under the current interaction result based on the playback traffic interaction result.
Parameter selection : Combine the expected revenue of traffic according to the parameters, and select the Top-K group parameters that make the overall revenue of all traffic the highest.
Parameter evolution : According to the Top-K parameters, calculate the new parameter mean and variance.
Parameter sampling : According to the new mean and variance, resample N groups of parameters, and jump to the second step until the parameter mean and variance converge.

Tips : The NES solution is not as effective as CEM in this scenario, because NES has too high requirements for Reward design for constrained problems (especially multi-constrained problems), and it is difficult to solve parameters that strictly satisfy the constraints in real scenarios.

3.1.2.2 Reward evaluation

Offline Reward Evaluation Process: During offline training, for selected agents and historical traffic.

Step1: The simulator constructs the initial state characteristics of the traffic and feeds it back to the Agent.
Step 2: The agent makes a decision on the recall channel gear according to the flow state characteristics given by the simulator.
Step 3: The simulator performs queue recall according to the recall decision result given by the Agent, and feeds back the recall result to the Agent.
Step4: Agent makes queue length decision based on recall result and initial traffic state.
Step5: The simulator simulates the truncation operation according to the decision result of the queue length given by the Agent, and feeds back the truncated queue status to the Agent.
Step 6: The agent makes a decision on the number of the estimated model according to the truncated queue.
Step7: The simulator gives a set of advertisement lists and decision-related features according to the model number decision.
Step8: Input the offline simulated advertisement list results into the revenue estimation model, and estimate the offline revenue corresponding to each request.
Step9: Count the Reward of the overall traffic as the evaluation result of the current Agent policy.

3.1.2.2.1 Offline Simulation

The dilemma faced by online environment interaction (necessity of offline simulation) : In theory, the decision-making agent interacting with the online environment can obtain the most realistic Reward (reward) feedback, but direct use of online traffic exploration will lead to the following problems:

Loss of online revenue : The process of online exploration of agent revenue is detrimental, especially in the early stage of strategy learning, strategy decisions are almost random, and online computing power constraints and revenue cannot be guaranteed.
Low traffic utilization : Agent learning often requires dozens or even hundreds of rounds of training, and each round of training contains multiple sets of feasible parameters. In order to accumulate confident traffic data, the traffic of each set of parameters should not be too small. Generally speaking, the training time and efficiency would be unacceptable.

The ultimate goal of offline simulation : to reproduce the online interaction logic and profit feedback.

Basic idea : Although the complex online environment cannot be completely reproduced, referring to the online environment interaction logic, a trade-off between efficiency and accuracy can be made through the offline advertising system simulator.
Other modules : In order to achieve this goal, we can use a supervised learning model to estimate its traffic Reward for specific ad queue information.

Offline simulation + revenue estimation solution :

Online random exploration traffic : leave a small amount of random exploration traffic online, randomly decide the candidate actions of each stage, and record the traffic log and the interaction results of the online system.
Offline simulation system : generate offline interactive results for historical traffic logs, imitating online logic, simulating recall, queue truncation, and rough CTR estimation.
Revenue estimation : As the core module of offline Reward evaluation, revenue estimation determines the evolution direction of parameters. We will introduce the revenue estimation scheme in detail in the next section.

3.1.2.2.2 Earnings Estimate

Goals and Challenges

Goal : Based on online blank traffic and random exploration traffic, estimate the expected revenue of requests under different actions.
Challenge point : Different from the local link CTR, CVR and GMV estimation tasks at the "user-ad" granularity in traditional advertising, this article is a full-link revenue estimation at request granularity, including request exposure, click, order (conversion) ), the problem is more complicated, especially in the face of data sparseness.
- Data sparse problem: Due to the long modeling link, when the user conversion data is very sparse, most of the traffic has no conversion action (meaning that the merchant's revenue is 0).

Model estimation scheme

model design
- Considering that the merchant revenue data is too sparse, the exposure and click data are relatively dense, and considering that exposure (platform revenue), click, order (merchant revenue) and other behaviors are strongly correlated behaviors, this estimation plan uses a multi-task model. Joint modeling.
feature engineering
- The features of each stage are discretized and added to the model through Embedding.
- According to the distribution of traffic data under different queue lengths, the characteristics of queue length and other characteristics are manually divided into buckets and then added to the model through Embedding.

3.1.3 Online Decision Making

For a single online request, the offline optimal agent is used to interact and make decisions with the online system. Consistent with the offline evaluation process, the decision-making process is performed in sequence as follows:

Step1: The system feeds back the initial state of traffic to the Agent.
Step 2: The agent makes a decision on the recall channel gear according to the system flow state.
Step 3: The system performs queue recall according to the recall decision result given by the Agent, and feeds back the recall result to the Agent.
Step4: Agent makes queue length decision based on recall result and initial traffic state.
Step5: The system performs the truncation operation according to the decision result of the queue length given by the Agent, and feeds back the truncated queue status to the Agent.
Step 6: The agent estimates the model numbering decision according to the queue state after truncated.
Step7: The system calls the estimation service according to the model number given by the Agent.

3.2 System Construction

In the first phase of intelligent computing power, we have completed the basic construction of an intelligent computing power system centered on decision-making components and supported by acquisition, regulation and offline components. In this issue, we revolve around the core requirement of scaling from single-action local optimal decision-making to multi-action combinatorial optimal decision-making. In terms of system construction, in addition to the basic ability of multi-action combination optimal decision-making, more attention is paid to the stability and versatility of the intelligent computing power system, thus supporting the comprehensive application of the intelligent computing power system in the entire business line of takeaway advertising.

3.2.1 Decision Component Agent

As the client of the intelligent computing power system, the decision-making component Agent is embedded in each module of the advertisement delivery system, and is responsible for the distribution and decision-making of the system traffic and computing power. In this issue, we mainly carried out lightweight and refined iterations on decision-making capabilities, and standardized construction of related capabilities.

in decision-making ability

Building a lightweight multi-action combination decision-making ability : We have realized a lightweight multi-action combination decision-making ability based on the evolutionary algorithm. The evolutionary algorithm has been introduced above, and the lightweight is mainly introduced here.

Why does it need to be lightweight: In the advertising delivery system, the online delay requirements are very strict, and sequence decisions need to be made under multiple actions. The number of decisions is theoretically equal to the number of decision-making actions, so intelligent computing power decisions must be effective To reduce (or slightly reduce) the weight as much as possible, in order to meet the online RT requirements.
How to build: (1) Model localization to reduce network delay. This is also the main reason for encapsulating decision-making capabilities into SDK instead of building model decision-making services. (2) The model is lightweight. Through feature engineering, the number of features is reduced as much as possible, and the performance pressure of online feature processing is reduced. (3) Parallel processing of decision-making, the decision-making action should be processed in parallel with the existing online process as much as possible to reduce the overall link time consumption.
Lightweight effect: multi-action combined decision-making is relatively single-action decision-making, and the advertising link time-consuming: TP99+1.8ms, TP999+2.6ms, which meets online RT requirements.

Build a refined system state feedback control capability : Based on the real-time collection of system state and PID feedback control algorithm, we fine-tune the computing power gear parameters to ensure the stability of the advertising system in the dynamic computing power distribution process.

Why refinement is needed: In the advertising delivery system, stability is very important. From single-action decision-making to complex multi-action decision-making, there are more and more parameter gears for intelligent computing power decision-making, which has an increasing impact on system stability. , the coarse-grained system state feedback control has been unable to guarantee the system stability. In the first phase of the elastic queue solution, abnormal stability control has also occurred. When the stability control is only performed based on the coarse-grained overall cluster system state data, the abnormal performance of a single machine will occasionally cause the overall cluster state to change drastically, resulting in computing power. Regulation is unstable.
How to build: On the one hand, the system status data is refined, the data granularity is refined from clusters to computer rooms and single machines, and data indicators support fine-grained custom expansion. On the other hand, it is the refinement of system control goals and strategies. The control goals range from the overall stability of the cluster to the stability of the computer room and single machine. We define the smallest unit of real-time feedback control of the system state as a governor. For each control goal, Requires support from one or a group of governors. In addition, in order to better support the feedback control of single-machine granularity, we migrate and reuse the system state feedback control capability from the regulation component to the decision-making component. The decision-making component can directly collect the state of some single-machine granularity by reading and intercepting container information. indicators, and apply the control results to the embedded machines to form closed-loop control; feedback control at the granularity of a single machine is no longer strongly dependent on the link feedback of the acquisition components, and the delay of the system state feedback has also been reduced from seconds to milliseconds, which greatly The accuracy and efficiency of feedback control are improved.

on standardized construction

Under the multi-action combination decision-making, there are new requirements for online decision-making. On the one hand, it is necessary to consider the versatility and do a good job in the precipitation of basic capabilities. The advertising engineering architecture has completed staged platform construction ^[4] , of which standardization is the basis of platform construction. Therefore, the intelligent computing power decision-making components have been standardized in terms of functions, data, and processes. The standardized construction of intelligent computing power is of great significance to the comprehensive construction of intelligent computing power from single-action decision-making to multi-action combination decision-making and then expanding to major business scenarios (point->line->surface).

functional standardization

We abstract the smallest inseparable functional unit as Action. Actions on the intelligent computing power decision link mainly include: experiment, feature extraction, feature calculation, dictionary processing, parameter processing, DCAF decision, ES-MACA decision, system Status feedback control, log collection, monitoring, etc. Through the reuse and expansion of Action, the access efficiency in new action scenarios and business lines is improved.

data normalization

In the platform-based construction of advertising projects, context is used to describe the environmental dependencies of Action execution, including input dependencies, configuration dependencies, and environmental parameter dependencies. In the online decision-making of intelligent computing power, we extend the intelligent computing power Context under the advertising basic Context, encapsulate and maintain the environmental dependence of intelligent computing power, mainly including standardized input and output, decision features, decision parameters, decision strategies, etc., between Actions Implement data interaction based on Context.

Process standardization

The business invocation process is to complete the combination of functions and data. The unified process design mode is the core means of business function reuse and efficiency improvement. Based on the platform-based management platform and scheduling engine, through the visual drag and drop of Action, the realization of DAG orchestration and scheduling with intelligent computing power function.

3.2.2 Acquisition and conditioning components

The collection component is responsible for collecting the status data of the advertising delivery system in real time and performing standardized preprocessing. On the one hand, the regulating component relies on the status data to realize real-time perception of the status of the entire advertising delivery system and control the computing power of the system module granularity; on the other hand, it acts as an intelligent The central control service of the computing power system is responsible for the system management of the intelligent computing power system, including business management, policy management, action management, and meta-information management.

We define the smallest unit of real-time feedback control of the system state as a controller. For each action decision, the computing power of one or more modules will be changed, and the computing power of each module will bring about changes in multiple data indicators. changes, so multiple governors may need to be configured for an action. From single-action decision-making to multi-action, the number of these governors will increase. How to improve the management and access efficiency of governors is a key issue. Here, we mainly carry out the standardization of heterogeneous data and the generalization of the regulation process, which basically realizes the configuration access of new regulation scenarios, without the need for development and release.

Heterogeneous data standardization

The collection component has multiple heterogeneous data sources, including business data reported by Meituan monitoring system CAT, machine indicator data collected by Falcon, and data reported by some decision-making components. After analyzing the data format and content, we first divide the data into the system module Appkey, the data between Appkeys is independent, and at the same time, starting from the data type (Type), the data is divided into business indicators (Biz) and machine indicators (Host) ; Starting from the data dimension (Dimension), the data is divided into cluster granularity (Cluster), computer room granularity (IDC), and stand-alone granularity (Standalone). The specific indicators (Metric) include QPS, TP99, FailRate, and other extended indicators.

Generalizing the control process

With the unified expression of heterogeneous data, we can design a general regulation process. We use ProductId as the unique identifier of the regulation business scenario and ControllerId as the unique identifier of the controller. A ProductId maps a set of ControllerIds, and each controller Controller Contains input indicators, control strategies, strategy parameters, and output results. The general regulation process is: obtaining the configured input indicators and regulation strategies, selecting different strategy parameters based on different regulation strategies, and executing the regulation strategy to obtain corresponding output results.

In addition, we optimized the regulatory efficiency and stability of the regulator. In the double-peak traffic scenario of takeaway, in off-peak hours, the accumulated error of the PID algorithm is easy to accumulate too much, resulting in a long regulation period and slow system state feedback adjustment during peak hours. At the same time, there are also unnecessary system jitter or data glitches. control situation.

Based on this, we adopt a sliding window admission and admission mechanism to improve efficiency and accuracy. As shown in the figure below, we maintain a sliding statistical window of system indicators for each controller. When the system indicator reaches the PID target value T-the threshold P set by the PID target value T for M consecutive times, the controller is successfully admitted and the error starts. Accumulation; at the same time, when the system index is lower than the threshold Q set by the PID target value T-for N consecutive times, the controller is successfully calibrated and the accumulated error is cleared.

3.2.3 Offline Components

The offline component is responsible for tasks such as offline model training and parameter solving. It mainly includes three parts: sample collection, model training and parameter solving.

Sample collection : In the online traffic, set aside a small amount of random exploration traffic, randomly decide the recall channel, queue length and different prediction models, and at the same time put random action and system interaction data into the table.
Model training : process random traffic logs offline, generate training samples, and train a DNN model for revenue estimation.
Parameter solution : In the process of CEM solution, for a given strategy, simulate the online interactive environment to generate traffic request information, and then use the revenue estimation model to estimate the revenue of the current advertising queue, thereby realizing CEM strategy evaluation.

4 experiments

4.1 Experimental setup

Selection of system computing power capacity

The selection of computing power capacity indicators is consistent with the first phase. On the one hand, in order to ensure that the online system can be quickly adjusted according to real-time traffic, 15 minutes is still selected as the minimum control unit; on the other hand, the system capacity selected for the offline simulation is the computing power of the afternoon peak traffic in the past week.

Baseline selection

The traffic without intelligent computing power (fixed decision-making) was selected as the control group.

Offline Simulation Simulator - Traffic Value Estimation

Use the non-experimental group data in the past 14 days as the training set, conduct two-stage training (one-stage full-flow training, two-stage random exploratory flow training), and use the random exploratory flow of the day as the test set.

Offline parameter solver

In the takeaway scenario, the trend of traffic flow is basically the same as that of the previous month. By replaying the traffic of the past week, we calculate the optimal parameters in each time slice (15 minutes as a time slice) offline and store it as a vocabulary.

4.2 Offline experiment

--	Reward
Baseline (system computing capacity=C)	+0.00%
Elastic channel only (system computing capacity=C)	+1.x%
Elastic queue only (system computing capacity=C)	+3.x%
Elastic model only (system computing capacity = C)	+1.x%
Optimal sub-module (system computing capacity = C)	+5.x%
ES-MACA (system computing capacity=C)	+5.x %

Experiment Description:

Baseline : The fixed decision result under computing power C.
Elastic Channel Only : In the "Resilient Channel Only" experiment, the Baseline fixed scheme was used for cohort decisions and model decisions, similar to the "Elastic Cohort Only" and "Decision Model Only" experimental groups.
Optimal sub-module : learn the elastic channel, elastic queue, and elastic model in turn. When the current module is learning, the parameters of the front module are fixed to the optimal parameters that have been learned, and the rear module uses the Baseline fixed scheme.
ES-MACA (full link optimization) : elastic channel + elastic queue + elastic model learning at the same time.

From the effect of offline experiments, we have the following conclusions:

The sum of the overall benefits of the optimal results of the three single actions is greater than that of the sub-modules, and it is also greater than ES-MACA, indicating that the three module strategies will affect each other, and the benefits of multiple actions during joint optimization are not simply summed.
The effect of the optimal sub-module scheme is not as good as that of the ES-MACA scheme (ES-MACA has a 0.53% improvement over the optimal sub-module), indicating that the strategy of the post-module also has a certain influence on the decision-making effect of the pre-module.

4.3 Online experiment

Through a week of online ABTest experiments, we verified the benefits of this scheme in takeaway advertisements as follows:

	CPM	GMV	income	CTR	CVR	Machine resources
Baseline (system computing capacity=C)	+0.00%	+0.00%	+0.00%	+0.00%	+0.00%	+0.00%
Elastic queue only (system computing capacity=C)	+0.x%	+2.x%	-0.x%	+0.x%	+1.x%	-0.05%
ES-MACA (system computing capacity=C)	+1.x%	+2.x%	+1.x%	+0.x%	+1.x%	-0.41%

Description of the experimental design:

Baseline : The control group, without any intelligent computing power decision.
Flexible cohort only : experimental group 1, decision-only flexible cohort (same as the first phase protocol).
ES-MACA (Full-Link Optimal) : Experimental group 2, which simultaneously decides elastic channels, elastic queues and elastic models.

5 Summary and Outlook

This article mainly introduces the technological evolution of Meituan's takeaway advertising intelligent computing power from linear programming algorithm to evolutionary algorithm from two aspects of full-link optimal computing power decision-making and system construction, and presents a method based on evolutionary algorithm. Multi-action computing power allocation scheme (ES-MACA).

In the future, in terms of algorithm strategy, we will try to strengthen the learning algorithm to model and solve the problem of optimal allocation of computing power under the full-link combination of the system; in terms of system construction, we will also try to cooperate with Meituan The infrastructure departments cooperate to expand from the online system to the online/nearline/offline three-tier system, and unify decision-making and scheduling through intelligent computing power to fully tap the potential of data and computing power.

6 References

[1] Shunhui, Jiahong, Song Wei, Guoliang, Qianlong, Le Bin, etc., the exploration and practice of the intelligent computing power of Meituan takeaway advertising .
[2] Jiang, B., Zhang, P., Chen, R., Luo, X., Yang, Y., Wang, G., ... & Gai, K. (2020). DCAF: A Dynamic Computation Allocation Framework for Online Serving System. arXiv preprint arXiv:2006.09684.
[3] Yang, X., Wang, Y., Chen, C., Tan, Q., Yu, C., Xu, J., & Zhu, X. (2021). Computation Resource Allocation Solution in Recommender Systems. arXiv preprint arXiv:2103.02259.
[4] Le Bin, Guoliang, Yulong, Wu Liang, Leixing, Wang Kun, Liu Yan, Siyuan, etc., Exploration and Practice of Advertising Platform | Meituan Takeaway Advertising Engineering Practice Series .

7 Author of this article

Jiahong, Shunhui, Guoliang, Qianlong, Lebin, etc. are all from the Meituan takeaway advertising technical team.

Read more collections of technical articles from the Meituan technical team

| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.