1. Background
With the rapid development of the Internet, a huge amount of information has appeared on the Internet. How to recommend interesting information for users is a big challenge? Various recommendation algorithms and systems have emerged. The estimation engine can be said to be a more important part of the recommendation system. The effect of the estimation engine seriously affects the effect of the algorithm. Combined with oppo's business scenarios, our estimation engine needs to solve several problems:
(1) Versatility: Oppo has many recommended business scenarios, including information flow, stores, short videos, alliances, lock screen magazines, music and other services. The estimation engine must support so many scenarios, and the framework must be universal.
(2) Multi-model estimation: In the advertising scenario, ocpx needs to be supported, and CTR and CVR need to be estimated at the same time in one request, which may be multiple different conversion models (such as download, payment, etc.).
(3) Dynamic update of models: Some models are updated hourly, and some models are updated daily, so it needs to be able to update dynamically without feeling.
(4) Scalability: The features, dictionaries, and model types used by each business may be different, so it is easier to expand.
2. Positioning and technology selection
Considering that the business characteristics of each business are quite different, if the unified estimation engine is to support the strategy experiment and diversion experiment of each business, it will make the estimation engine module very bloated and impossible to maintain. Therefore, it is finally decided that the estimation engine only does things related to estimation.
From the original data to the estimated CTR result, it needs to go through feature extraction, and then the model prediction is performed on the result after the feature is extracted. Considering that the feature extraction results are relatively large, the feature results required for an estimation are about 2M. If such a large data is transmitted through the network, it will take too long. Therefore, the overall process of the estimation engine includes two parts: feature extraction and model estimation.
3. Design and Implementation of Predictor
3.1 The position of the Predictor module in the entire recommendation system
Take the oppo mobile application market as a category to illustrate the position of the predictor module in the entire system. The ranker module is responsible for sorting, including various diversion experiments and various operating strategies. It communicates with the predictor module through the grpc framework.
3.2 The main flow of the Predictor module
The figure shows the processing flow of two requests. The features mentioned in the figure include multiple feature confs, and each sample and each feature configuration is extracted once. The estimation will also be multiple models, and each sample, each conf, and each model will be estimated multiple times. Feature extraction relies on external dictionaries, and estimation relies on external model files. The update of external dictionaries and double buf switching are all done through the dictionary management module. The following is a detailed description of the dictionary management module, sub-task, feature extraction, estimation, and merge.
3.3 Dictionary management module
As shown in the figure below: conf1, conf2, lr_model_x0, etc. represent files on the disk. Each file name is parsed by a different dictionary parsing class. This dictionary parsing class is responsible for managing the loading of this file and switching between double bufs. For example: FeatureConfDict is responsible for parsing conf1. It stores two FeatureConfList type bufs inside. When the conf1 file is updated, it uses the standby FeatureConfList to load it. After the loading is completed, the service uses the main FeatureConfList pointer during the loading process. After the loading is complete, perform a master-slave switch, and release to the old buf memory after no request to use the old buf.
3.4 Sub-task logic
A request is received, which specifies multiple conf and multiple models to estimate. As shown below:
The above figure shows that there are a total of 8 samples, and two confs: conf0 and conf1 are used to extract features. The result of conf0 is estimated by model0 and model1, and conf1 is estimated by model2 and model3. According to 2 samples and one task, after splitting the task, 4 tasks will be obtained, as shown in the following figure:
Split into 4 tasks according to the sample dimensions, and throw them into the thread pool for execution.
3.5 Merge process
After the estimation is completed, you need to organize the estimation results to the ranker according to the conf/model dimension. The final response is shown in the right sub-figure of the following figure.
3.6 Design of Feature Extraction Framework
The feature extraction framework is used both online and offline, so it is better to ensure consistency between online and offline. Therefore, the online and offline business scenarios should be considered at the same time when designing.
Online is to combine the data brought by the request and the dictionary data into a sample for feature extraction. Offline is to compile so and call it through mapreduce, and the sample is obtained by deserializing a piece of text in hdfs.
3.6.1 Feature configuration file format
The feature configuration file consists of two parts: the schema part and the feature operator part.
3.6.2 schema part
There are 5 schema configurations in the above figure
user_schema: Indicates current user-related information, which is only used in online mode and brought by upstream requests.
item_schema: indicates the recommended item-related information, which is only used in online mode, part of it is brought by the request, and part of it is obtained from the dictionary file.
context_schema: Indicates the recommended context-related information, which is only used in online mode and brought by the game request. For example: Is the current network status wifi or 4G.
all_schema: Indicates the schema information of the final sample. In the online mode, the fields of user_schema, item_schema, and context_schema are placed in the corresponding positions of all_schema. The offline module is to deserialize lines of text in hdfs according to the type specified by all_schema_type. Regardless of whether it is online or offline, the final field order of the sample of the feature framework is stored in the order of all_schema
all_schema_type: Only used in offline mode. It specifies the type of each schema. These type names are defined in advance. In offline mode, each field is deserialized according to the schema type.
3.6.3 Feature Operator Configuration Part
Each feature includes the following fields:
Name: Feature name
Class: Indicates which feature operator class is used for this feature, corresponding to the class name in the code
Slot: uniquely identifies a feature
Depend: Indicates which fields the feature depends on, this field must exist in the above all_schema
Args: Represents the parameters passed to the feature operator, the frame will be converted to float and passed to the operator
Str_args: The parameters passed to the characteristic operator, passed in the form of a string.
3.6.4 Features are divided into groups (common and uncommon)
In an estimation request, the user and context information of all samples are the same. Considering that some features only rely on user and context information, this part of the feature only needs to be extracted once and is shared by all samples.
Some of the features in the feature configuration will depend on other features (such as combined features, cvm features), so it is necessary to analyze the dependency of the feature to determine the field information that a feature ultimately depends on.
The i_id feature depends on the item_id of the item field, so it is an uncommon feature
The u_a/net feature only depends on the user_schema or context field, and does not rely on the item field, so it is a common feature
The u_a-i_id combination feature depends on the i_id feature, and the interval depends on the item_id, so it is an uncommon feature.
The combined feature of u_a-net only relies on the u_a and network fields, so it is a common feature. In feature extraction, a request is counted only once.
Note: The feature group here is the asynchronous dictionary update thread to be responsible for the calculation, not the calculation of the request.
3.7 Estimated part
As mentioned earlier, a request specifies which feature profile to use to extract features and which model to use for estimation. All models have an asynchronous dictionary update thread to be responsible for the update. Currently, it supports the estimation of models such as LR, FM, DSSM, Deep&Wide, and is relatively easy to expand. The following roughly introduces the next two models that have been deeply optimized according to business scenarios:
3.7.1 FM model estimation (LR similar)
among them
Considering the business scenario, the user/context information of multiple samples is the same, so the online FM estimate can be written in this form:
All samples in the red part are the same, and a request is only calculated once.
3.7.2 DSSM (Double Tower) Model
The network structure of the twin tower model is shown in the figure below:
In fact, there are three towers, C (context information), U (user information), I (item information). The vector obtained by user and item sub-towers is subjected to dot product, and then summed with C.
3.7.3 Online serving part
Considering that the information of the item in some scenes is relatively slow to change, generally the sub-tower of the item is calculated offline first, and the vector is obtained, which is dynamically loaded through a dictionary.
When online, only the c tower and the u tower are used to calculate, and only one sample needs to be calculated for a request; the vector for the I tower is obtained by looking up the dictionary. Compared with full connection, the amount of calculation is greatly reduced. The performance has been greatly improved, but because the user information and the item only have one level of dot product multiplication, the offline auc will drop by 1% compared to the full connection, so it is more suitable for scenarios that do not require high accuracy, such as recall or coarse sorting. In the information flow advertising and alliance advertising business, the original statistical ctr rough ranking was replaced, and the comprehensive index increased by 5 to 6 percentage points.
3.8 Performance optimization
The estimation engine module has high requirements for time delay, but in order to achieve a better algorithm effect, the feature scale is continuously increasing, so when designing the estimation engine, many performance optimization considerations have been made. Mainly include the following aspects:
(1) Reduce memory allocation and improve performance through the object pool.
(2) The dependencies of feature fields are all converted into subscripts in advance. When extracting features, the subscripts are used directly to get the corresponding fields to reduce the amount of calculation.
(3) Some features rely on the results of other features, and will frequently query the corresponding results according to the slot. Considering the limited data of the slot, an array is used instead.
4. Summary
At present, the predictor module supports most recommended scenarios, including information flow content, information flow advertising, application market, alliance, search, short video, oppo lock screen magazine, music and other scenarios. It is deployed on nearly 2000 machines, indicating this design It is also relatively stable and has better scalability. Currently, it supports smooth switching of services to the dnn model, and various business scenarios have achieved certain benefits.
Author profile
Xiao Chao Senior Data Mining Engineer
10+ years of experience in advertising system engineering. At Oppo, he is mainly responsible for the engineering of model feature extraction and reasoning.
For more exciting content, please scan the QR code to follow the [OPPO Internet Technology] public account
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。