As an emerging business with explosive growth, script killing has shortcomings in terms of merchant listing, user purchase, and supply and demand matching. Standardization of supply can create value for users, merchants, and platforms, and help business growth. This article introduces the process and algorithm plan of the Meituan-to-store integrated business data team from 0 to 1 to quickly build script kill supply standardization. We cover the GENE, GEneral NEeds net of Meituan to the store to the script killing industry, build a script killing knowledge graph, and realize the standardization of supply, including script killing supply mining, standard script library construction, supply and standard script association, etc. We hope to bring some help or enlightenment to everyone.
1. Background
The script-killing industry has shown explosive growth in recent years. However, because script-killing is an emerging industry, the existing category system and product form of the platform are becoming increasingly difficult to meet the rapidly growing needs of users and merchants, which are mainly manifested in the following three aspects. :
- Platform category is missing : The platform lacks a dedicated "script kill" category, and the lack of centralized traffic entry leads to confusion in user decision-making paths, and it is difficult to establish a unified user cognition.
- User decision-making efficiency is low : The core of the script killing is the script. Due to the lack of a standard script library, the relationship between the standard script and the supply has not been established, resulting in a low degree of standardization of the script information display and supply management, which affects the user's choice of scripts Efficiency of decision-making.
- products are cumbersome to on the shelves 1617fc8bfb581a: The product information needs to be manually entered by the merchants, and there is no standard template available for information pre-filling, resulting in a low proportion of the scripts of the merchants on the platform, and there is a large room for improvement in the efficiency of the shelves.
In order to solve the above-mentioned pain points, the business needs to standardize the supply of script kills: first establish a new category of "script kills", and complete the category migration of corresponding supplies (including merchants, products, and content). Based on this, with scripts as the core, a standard script library is built, and script kill supplies are associated, and then script-dimensional information distribution channels, evaluation ratings and ranking systems are established to meet the user's decision-making path of "finding stores with scripts".
It is worth pointing out that supply standardization is an important starting point for simplifying user cognition, helping users make decisions, and promoting the matching of supply and demand. The level of standardization has a decisive impact on the scale of platform business. Specific to the script-killing industry, the construction of supply standardization is an important foundation for the continued growth of the script-killing business, and the establishment of a standard script library is the key to the standardization of script-killing supply. Since it is impossible to determine the specific script based on the script attributes such as "city limit", background such as "ancient style", and theme such as "emotion", the name of the script such as "Secure" can serve as a unique identifier. Therefore, the construction of the standard script library is firstly the construction of the standard script name, and the second is the construction of the standard script attributes such as specifications, background, subject matter, difficulty, and genre.
To sum up, the Meituan to-store integrated business data team is in line with the business, helping the business to standardize the supply of script kills. In the construction process, multiple types of entities such as script name, script attributes, categories, merchants, commodities, content, etc. are involved, as well as the construction of diversified relationships between them. As a semantic network that reveals entities and the relationships between entities, knowledge graphs are particularly suitable to solve this problem. In particular, we have built to the store comprehensive knowledge graph (GENE, GEneral NEeds net) , therefore, based on the construction experience of GENE, we quickly build the knowledge graph of the new business of script killing, from 0 to 1. Scripts kill standard construction, thereby improving supply management and supply-demand matching, and creating greater value for users, merchants, and platforms.
Two, the solution
The GENE built by us revolves around the comprehensive needs of local life users, and progresses layer by layer in five levels: industry system, demand objects, concrete needs, scene elements, and scene needs, covering fun, medical beauty, education, parent-child, marriage, etc. For multiple services, system design and technical details, to the store comprehensive knowledge map related articles. As an emerging Meituan to-store integrated business, Script Killing reflects the new needs of users in having fun and naturally adapts to GENE's architecture. Therefore, we will cover GENE to the script kill new business, and use the same idea to construct the corresponding knowledge graph to achieve the corresponding supply standardization.
The key to realizing the standardization of script killing based on the knowledge graph is to construct the knowledge graph of script killing based on the standard script as the core. The design of the graph system is shown in Figure 1. Specifically, firstly, at the industry system level, the script-killing new category is constructed, the script-killing supply is tapped, and the affiliation between the supply (including merchants, products, and content) and the category is established. On this basis, at the demand object layer, further realize the mining and relationship construction of the core object node of the standard script name and its script attribute nodes, establish a standard script library, and finally combine each standard script in the standard script library with the supply and user Establish an association relationship. In addition, the three layers of concrete demand, scene elements, and scene demand realize the explicit expression of users' concrete service demand and scene-oriented demand on script killing. This is partly due to the lack of connection with the standardization of script killing supply. I will not introduce it here.
The specific example of the script killing knowledge graph used to supply the standardized part is shown in Figure 2 below. Among them, the standard script name is the core node, and various standard script attribute nodes surrounding it include subject matter, specification, genre, difficulty, background, nickname, etc. At the same time, standard scripts may build relationships of types such as "same series", such as "Separation" and "Separation 2". In addition, the standard script will also establish an association relationship with products, merchants, content, and users.
We standardize the supply based on these nodes and relationships of the script kill knowledge graph. In the process of building the map, we include script kill supply mining , standard script library construction , supply and standard script related to . Introduce the implementation details of the three steps and the algorithms involved.
Three, realization method
3.1 Script kill supply mining
As an emerging business, script kills does not have a corresponding category in the existing industry category tree, and it is impossible to directly obtain relevant supplies (including merchants, products and content) of script kills based on the categories. Therefore, we need to first dig out the supply of script-killing, that is, to dig out the relevant supply of script-killing from the current supply of similar categories in the script-killing industry.
For the mining of the supply of script-killing merchants, it is necessary to determine whether the merchant provides script-killing services. The judgment basis includes text corpus from three sources: merchant name, product name and product details, and merchant UGC. This is essentially a multi-source data classification problem. However, due to the lack of labeled training samples, we did not directly adopt the end-to-end multi-source data classification model. Instead, relying on business input, we use a combination of unsupervised matching and supervised fitting. The method is efficiently realized, and the specific judgment process is shown in Figure 3 below, where:
- unsupervised matching : first construct a keyword dictionary related to script killing, and perform exact matching in the text corpus of three sources: merchant name, product name and product details, and merchant UGC, and build based on BERT [1] The general semantic drift discrimination model is used to filter the matching results. Finally, the corresponding matching score is calculated based on the matching results of each source according to the business rules.
- Supervised fitting : In order to quantify the impact of matching scores from different sources on the final discrimination result, operators first manually mark a small number of merchant scores to characterize the strength of the merchant's script killing service. On this basis, we constructed a linear regression model to fit the marked merchant scores and obtain the weight of each source, so as to achieve accurate mining of the script-killing merchants.
Using the above method, the excavation of both desktop and real-world scripts to kill merchants has been realized, and the accuracy and recall rates have reached the requirements. Based on the mining results of the script-killing merchants, it is possible to further dig out the products and create script-killing categories, thereby laying a good data foundation for the subsequent script-killing knowledge graph construction and standardization construction.
3.2 Standard script library construction
As the core of the entire script-killing knowledge graph, standard scripts play an important role in the standardization of script-killing supply. We mine standard scripts based on the similar aggregation method of script killing products, combined with manual review, and obtain script authorization from related issuers to build a standard script library. The standard script consists of two parts, one is the standard script name, and the other is the standard script attributes. Therefore, the standard script library construction is also divided into two parts: the mining of standard script names and the mining of standard script attributes.
3.2.1 Mining the name of the standard script
Based on the characteristics of script-killing products, we have successively adopted three methods of rule aggregation, semantic aggregation, and multi-modal aggregation for mining and iteration, and aggregated thousands of standard script names from hundreds of thousands of script-killing commodity names. The three polymerization methods are respectively introduced below.
Rule aggregation
The naming of the same script-killing product in different merchants is often different, and there are more irregularities and personalizations. On the one hand, the same script name itself can have multiple names. For example, "She Li", "She Li Yi", and "She Li 1" are the same script; on the other hand, the script name includes the script name. In addition, merchants often add attribute information such as the specifications and themes of the script, as well as descriptive text that attracts users, such as "The Emotional Book of "Departure"." Therefore, we first consider the naming characteristics of the script kill product, and design a corresponding cleaning strategy to clean the script kill product name before aggregation.
In addition to sorting out common non-script words and constructing a thesaurus for regular filtering, we also tried to convert them into a named entity recognition problem [2] , using sequence labeling to mark the characters "is the script name" and "not the script name" The distinction between the two categories. For the cleaned script kill product names, they are aggregated through the similarity calculation rule based on the longest common subsequence (LCS), combined with threshold filtering, such as "discard", "discard one", "discard 1" "In the end, they all got together. The entire process is shown in Figure 4 above. The rule aggregation method can help the business quickly aggregate the script kill product names in the early stage of construction.
semantic aggregation
Although the method of rule aggregation is simple and easy to use, due to the diversity and complexity of the script names, we found that there are still some problems in the aggregation results: 1) Products that do not belong to the same script are aggregated, such as "Separation" and "Sec" Li 2" are two different scripts of the same series, but they are grouped together. 2) There is no aggregation of products belonging to the same script. For example, the product name uses the abbreviation of the script ("Chinatown Detective and Cat" and "Tang Detective Cat") or typos ("Floyd's Anchor" and "Buddha"). Lloyd’s Anchor”), etc., it is difficult to rule aggregation.
In response to the above two problems, we further consider the use of product name semantic matching to aggregate from the perspective of the same text semantics. Commonly used text semantic matching models are divided into two types: interactive and twin towers. Interactive is to input two pieces of text into the encoder together, let them exchange information during the encoding process, and then make a judgment; the twin tower model uses an encoder to encode two texts respectively, and then based on two The vector is used for discrimination.
Due to the large number of commodities, the interactive method requires the combination of the commodity names to be used for model prediction, and the efficiency is relatively low. For this reason, we adopt a two-tower method to achieve this, with Sentence-BERT [3] Based on the model structure, after extracting the vector of the two product name texts through BERT, the cosine distance is used to measure the similarity between the two. The complete structure is shown in Figure 5 below:
In the process of training the model, based on the results of rule aggregation, we construct coarse-grained training samples by generating positive examples within the same cluster and generating negative examples across clusters to complete the training of the first version of the model. On this basis, the sample data is further combined with active learning to improve the sample data. In addition, we also aggregate the two problems that occur according to the rules mentioned above, and generate samples in batches in a targeted manner. Specifically, the automatic construction of the sample is realized by adding the same serial number after the product name, and using typos, idioms, and traditional characters replacement.
Multimodal aggregation
Through semantic aggregation, the synonymous aggregation from the semantic level of the product name text is realized. However, after reanalyzing the aggregation results, we found that there are still some problems: two products belong to the same script, but only from the perspective of the product name Discriminate. For example, "Sheli 2" and "Duannian" cannot be aggregated from a semantic point of view, but they are essentially a script "Sheli 2·Duannian". Although the names of these two commodities are different, their images are often the same or similar. For this reason, we consider introducing the image information of the commodities to assist in aggregation.
A simple method is to use a mature pre-training model in the CV field as an image encoder for feature extraction and directly calculate the image similarity of the two products. In order to unify the results of product image similarity calculation and product name semantic matching, we try to construct a multi-modal matching model of script killing products, making full use of product name and image information to match. The model follows the double-tower structure used in semantic aggregation, and the overall structure is shown in Figure 6 below:
In the multi-modal matching model, the name and image of the script-killing product are represented by the corresponding vector through a text encoder and an image encoder, and then spliced as the final product vector. Finally, the cosine similarity is used to measure the products. The similarity. in:
- text encoder : Use the text pre-training model BERT [1] as a text encoder, and the output is averagely pooled as a vector representation of the text.
- image encoder : use the image pre-training model EfficientNet [4] as the image encoder, extract the last layer of the network output as the vector representation of the image.
In the process of training the model, the text encoder will perform Finetune, while the image encoder has fixed parameters and does not participate in training. For the construction of training samples, we use the results of semantic aggregation as the basis, and use the similarity of product images to delineate the range of artificially labeled samples. Specifically, for the direct generation of positive examples of product images in the same cluster with high similarity, and the direct generation of negative examples for product images with low similarity across clusters, the remaining sample pairs are manually labeled and determined. Through multi-modal aggregation, it makes up for the shortcomings of only using text matching. Compared with this, the accuracy rate is increased by 5%, which further improves the mining effect of standard scripts.
3.2.2 Mining of standard script attributes
The attributes of a standard script include more than ten dimensions such as the background, specifications, genre, theme, and difficulty of the script. Since the merchant will enter these attribute values of the product when the product is ordered in the script, the mining of the attributes of the standard script is essentially the mining of the attributes of all aggregate products corresponding to the standard script.
In the actual process, we use voting statistics to mine, that is, for a certain attribute of the standard script, vote by the attribute value of the corresponding aggregated product on the attribute, and select the attribute value with the highest vote as the standard script Candidate attribute values of, and finally confirmed by manual review. In addition, in the process of digging the name of the standard script, we found that the same script has a variety of names. In order to have a better description of the standard script, we have further added the attribute of an individual name to the standard script. The names of all corresponding aggregated products are cleaned and deduplicated to obtain them.
3.3 Supply and standard script linkage
After completing the construction of the standard script library, it is also necessary to establish the three kinds of supply of script kill products, merchants and content, and the association relationship with the standard script, so as to standardize the supply of script kills. Since the relationship between the product and the standard script can be directly obtained through the relationship between the corresponding merchant and the standard script of the product, we only need to associate the product and content with the standard script.
3.3.1 Commodity association
In section 3.2, we use the method of aggregating stock scripts to kill commodities to mine standard scripts. In this process, we have actually constructed the relationship between stock commodities and standard scripts. For the newly added products in the future, we also need to match them with the standard script to establish the relationship between the two. For products that cannot be associated with standard scripts, we will automatically mine the names and attributes of standard scripts, and then add them to the standard script library after manual review.
The entire product association process is shown in Figure 7 below. First, the product name is cleaned and then matched and associated. In the matching link, we judge the match between the product and the standard script based on the name and image of the multi-modal information.
Unlike matching between commodities, the association between commodities and standard scripts does not need to maintain the symmetry of matching. In order to ensure the effect of the association, we modify the structure of the multi-modal matching model in Section 3.2.1. After the vector of the product and the standard script are spliced, the probability of the association between the two is calculated through the fully connected layer and the softmax layer. The training samples are constructed directly based on the correlation between the stock commodities and the standard script. Through commodity association, we have achieved the standardization of most script-killing commodities.
3.3.2 Content Association
As for the standard scripts associated with script killing content, it is mainly aimed at the association between user-generated content (UGC, such as user reviews) and standard scripts. Since a UGC text usually contains multiple sentences, and only some of the sentences will mention the relevant information of the standard script, we refine the matching of UGC and the standard script to match the granularity of its clauses, and at the same time for efficiency and effectiveness. In consideration of balance, the matching process is further divided into two stages: recall and sorting, as shown in Figure 8 below:
In the recall phase, the UGC text is divided into clauses, and exact matching is performed in the clause set according to the standard script name and its nicknames, and the matching clauses will enter the sorting phase for refined association relationship discrimination .
In the sorting stage, the association relationship discrimination is converted into an Aspect-based classification problem, referring to the method of attribute-level sentiment classification [5] , constructing a matching model based on BERT sentence relationship classification, and actually hitting the standard of UGC clauses The script nickname and the corresponding UGC clause are connected with [SEP] and then input. By adding a fully connected layer and a softmax layer after the BERT, the two classifications of whether they are related are realized. Finally, the classification probability of the model output is thresholded to obtain the UGC related Standard script.
Unlike the model training mentioned above, the matching model of UGC and standard scripts cannot quickly obtain a large number of training samples. Considering the lack of training samples, we first manually label hundreds of samples. On this basis, in addition to active learning, we also try to compare learning. Based on the Regularized Dropout [6] method, dropout the model twice The output of is subject to regular constraints. In the end, when the training samples were less than 1K, the accuracy of UGC-related standard scripts reached the online requirements, and the number of UGCs associated with each standard script was also greatly increased.
Four, application practice
The current script-killing knowledge graph, with thousands of standard scripts as the core, is associated with millions of supplies. As a result of the standardization of script killing, the preliminary application practice has been carried out in various business scenarios of Meituan. The following describes the specific application methods and application effects.
4.1 Category construction
Exploit the supply of script kills to help the business identify script kill merchants, thereby assisting the construction of new script kill categories and corresponding script kill list pages. Script kill category migration, the script kill entry of the leisure and entertainment channel page, and the script kill list page have been launched. Among them, the channel page script kill ICON is fixed at the top of the third row, providing a centralized traffic entrance, which helps to establish a unified user Cognition. The online example is shown in Figure 9 ((a) the script kill entry on the entertainment channel page, (b) the script kill list page).
4.2 Personalized recommendation
The standard scripts and attribute nodes included in the Script Killing Knowledge Graph, as well as their relationship with the supply and users, can be applied to the recommended positions of each page of the Script Killing page. On the one hand, it is applied to the popular script recommendation on the script list page (Figure 10(a)), on the other hand, it is also applied to the product recommendation on the screenplay details page (Figure 10(b), left), and playable store recommendation (Figure 10() b) Left) and related script recommended modules (Figure 10(b) right). The applications of these recommended bits help cultivate users' minds in finding scripts on the platform, optimize user cognition and shopping experience, and improve the matching efficiency between users and supplies.
Taking the popular script recommendation module on the script list page as an example, the nodes and relationships contained in the script kill knowledge graph can be directly used for script recall, and can also be further applied in the refinement stage. In the refinement, we based on the script kill knowledge graph, combined with user behavior, refer to the Deep Interest Network (DIN) [7] model structure, try to model the sequence of the user's access to the script and the sequence of the product, and build a dual channel DIN model, in-depth description of user interests, to achieve personalized distribution of scripts. Among them, the product access sequence part is converted into a script sequence through the association relationship between the product and the standard script, and the candidate script is modeled by the Attention method. The specific model structure is shown in Figure 11:
4.3 Information disclosure and screening
Based on the nodes and relationships in the script kill knowledge graph, add relevant tag filtering items on the script kill list page and the script list page, and expose the attributes and associated supply information of the script. The related applications are shown in Figure 12 below. The exposure of these label filtering items and information provides users with standardized information display, reduces the cost of user decision-making, and makes it more convenient for users to choose stores and scripts.
4.4 Ratings and rankings
On the script details page, the relationship between the content and the standard script participates in the script score calculation (Figure 13(a)). On this basis, based on the script dimension, a list of classic must-play and recent popular scripts is formed, as shown in Figure 13(b), which provides more help for users' script selection decisions.
V. Summary and Outlook
Faced with the emerging industry of script killing, we responded quickly to the business, taking standard scripts as the core node, combined with industry characteristics, through script killing supply mining, standard script library construction, supply and standard script correlation, and building the corresponding knowledge graph, starting from 0 To 1 gradually promote the standardization of the supply of script killing, and strive to solve the problem of the script killing business in a simple and effective way.
At present, the script-killing knowledge graph has achieved application results in multiple business scenarios of script-killing, enabling the continuous growth of the script-killing business and significantly improving the user experience. In the future work, we will continue to optimize and explore:
- standard script library continues to improve : optimize the standard script name and attributes and the corresponding supply relationship to ensure the quality and quantity of the standard script library, and try to introduce external knowledge to supplement the current standardization results.
- Script-killing scenario-based : The current script-killing knowledge graph is mainly based on the concrete needs of users such as "scripts". The follow-up will dig deeper into the user’s scenario-based needs, explore the linkage between script-killing and other industries, and better assist Scripts kill the development of the industry.
- More application exploration : Apply graph data to search and other modules to improve supply matching efficiency in more application scenarios, thereby creating greater value.
references
[1] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[2] Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[J]. arXiv preprint arXiv:1603.01360, 2016.
[3] Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks[J]. arXiv preprint arXiv:1908.10084, 2019.
[4] Tan M, Le Q. EfficientNet: Rethinking model scaling for convolutional neural networks[C]//International Conference on Machine Learning. PMLR, 2019: 6105-6114.
[5] Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence[J]. arXiv preprint arXiv:1903.09588, 2019.
[6] Liang X, Wu L, Li J, et al. R-Drop: Regularized Dropout for Neural Networks[J]. arXiv preprint arXiv:2106.14448, 2021.
[7] Zhou G, Zhu X, Song C, et al. Deep interest network for click-through rate prediction[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 1059-1068.
About the Author
Li Xiang, Chen Huan, Zhihua, Xiaoyang, Wang Qi, etc., all come from the Meituan to the store platform technology department to the comprehensive business data team.
Job Offers
Meituan to shop platform technology department-to comprehensive business data team, long-term recruitment algorithm (natural language processing/recommendation algorithm), data warehouse, data science, system development and other positions students, coordinates Shanghai. Interested students are welcome to send their resumes to: licong.yu@meituan.com.
Read more technical articles from the
the front | algorithm | backend | data | security | operation and maintenance | iOS | Android | test
1617fc8bfb62de | Reply to keywords such as [products in 2020], [products in 2019], [products in 2018], and [products in 2017] in the menu bar dialog box of the
| This article is produced by the Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "the content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activity, please send an email to tech@meituan.com to apply for authorization.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。