Introduction to opens up search for NLP industry models and lightweight customer customization solutions to solve the problems of reducing customer labeling costs, no labeling at all or a small amount of simple labeling, and making the search field easier to use.
Special guest:
Xu Guangwei (Kunca)-Alibaba Algorithm Expert
Video address: https://yqh.aliyun.com/live/opensearch
Search NLP algorithm
Search link
This is a complete link from query words to search results, where the NLP algorithm plays a role mainly in the second stage of query analysis, which contains multiple NLP algorithm modules, such as word segmentation, error correction, and entities on the text side. Recognition, word weights, synonyms, semantic vectors, etc. The system is an architecture that combines text and semantic vector multi-channel recall ordering to meet the search effect requirements of different business scenarios. Of course, in addition to query analysis, there are also many applications of NLP algorithms in the first stage of search guidance and the fourth stage of sorting services.
Query analysis
The NLP algorithm mainly plays a role in several sub-modules here:
- word segmentation , precise word segmentation can improve retrieval efficiency and make recall results more accurate.
- Spelling error correction , for spelling errors in the query entered by the user, it can automatically correct the spelling and improve the search experience.
- Entity recognition can tag each word in the query with a corresponding entity tag, thereby providing key features for subsequent query rewriting and sorting.
- word weight model 1615d497b6158a, each word will be marked with high, medium, and low gears, and
- synonymous with , and the words with the same meaning are expanded to expand the scope of recall.
- Finally, after a complete query analysis module, an overall query is changed to , which converts the query entered by the user into a query string that can be recognized by our search engine.
Now open search not only supports Ali's self-developed search engine, but is also compatible with the open source ES engine, allowing users to use our algorithm capabilities more conveniently.
Industry model
Customer pain points
1. Difficult to adapt to the general model domain
- The general model mainly solves the problems of the news and information industry;
- The effect will be greatly reduced in specific industries;
For example: the difference between the models in the general field and the e-commerce field
2. Less public industry models
- Cloud service providers basically only provide general models
- Public industry data sets also mainly cover general areas
Solve the difficulty
The process of constructing an industry search NLP model:
- The first is labeled dataset this step for the industry knowledge requirements are very high, while for amount of data requirements need to reach million level, label such data as well as take several months .
- Next is model training. This step requires professional algorithm personnel . If you are not familiar with the algorithm, the iterative efficiency of the model will be very low.
- Finally, the model online step requires engineers to deploy and maintain . If it involves some online in-depth models, there will be a lot of efficiency optimization work that needs to be done. In the data set annotation stage, there are actually many challenges.
Difficulties in word segmentation
1. High domain knowledge requirements
E.g:
- The name of the drug : Lidocaine and Chlorhexidine Aerosol | Lidocaine and Chlorhexidine Aerosol
- Address : Wangying Village, Sikeshu Township, Nanzhao County | Wangying Village, Sikeshu Township, Nanzhao County
2. Difficult to judge cross ambiguity
E.g:
- Laundry powder | laundry powder
Difficulties in Entity Recognition and Labeling
1. High domain knowledge requirements
E.g:
- Australia Aitamei (maternal and child brand) gold outfit, Kobe (sneakers series) 4
- pytorch implements GAN (algorithm model)
Solution
Open Search is based on the data accumulation of Alibaba's internal search, combined with automated data mining and self-developed algorithm models, and has made a transformation to the construction link of the industry model.
Also take word segmentation and NER as an example, the following model diagram is the process of word segmentation. We first use an automatic new word discovery algorithm to mine new words in the target field. After getting these new words, we will build a remotely supervised training data in the target field.
Based on such remotely supervised training data, we proposed a adversarial learning network structure model , the structure can achieve the effect of noise reduction, and thus obtained a domain model of our target field last year.
The following model diagram is the process of NER. We use combined with graph neural network graph NER model structure , which can integrate knowledge base and annotation data. The knowledge base is a new word automatically excavated by the new word discovery module in the link of the word segmentation just now, and then we do an automatic entity word marking to construct the knowledge base of the domain. The corresponding technical papers have been published on ACL, the top conference in the NLP field.
In summary, through the technical solutions mentioned above, take the e-commerce industry as an example to see the effect achieved on the open search industry model.
You can see that the enhanced version of the open search e-commerce industry is significantly better than the general version.
This solution is not only applicable to the e-commerce industry, as long as it is an industry with data accumulation, a set of industry models can be quickly constructed.
Open search, lightweight customer customization
Customer pain points
First of all, you can see that the direct use of the general model can probably achieve an effect of 60 points.
The industry model just mentioned can be applied to an effect of 80 points.
But specific to each customer, there is a customization problem in the subdivisions. The goal of general customers may be to reach 90 points .
For example, the following two examples:
- The "Vance Soda Series" on the left is actually a specific brand and series name of a sneaker. Although the open search e-commerce model can identify the brand and common words correctly, but for the specific subdivision series of soda It is not properly identified.
- The example on the right below is "Han Ben Cui Bao Wei Drink". The e-commerce model of the open search here does not identify its unique brand and its sub-series at all. If customers do independent customized optimization based on the industry model we provide, they will also encounter those of the industry model solutions described above. Problem, it is difficult to break 85 points in the end,
Our goal is to reduce the cost of labeling for customers. There is no labeling or a small amount of simple labeling, so that customer customization will be easier to use, so as to directly achieve an effect of 85 points.
Solutions
The overall process is similar to the industry model building link. These capability products must be instrumented so that customers can independently participate in tuning.
- New training model
The figure below is a tool demo we made. The above is a model creation. Some customers can choose the basic industry model to create, and then upload the unlabeled data in their own field to automatically start the training of the model.
2. Effect evaluation
The following is an intuitive effect evaluation that customers can perform on our system after model training. You can see that the basic model and the changes in the effect of the model after automatic training will be listed here. Customers can also do a small amount of manual work. Annotate to verify the effect of the model.
This link has been used internally by Alibaba, and it will be revealed to customers on open search products in the near future. It turns out that it may take one to two months for us to do a lightweight customer customization to achieve the above effects. To label these labeled data with more than 10,000 sentences. Now based on this scheme, it only takes one week, no labeling at all, or only need to label 1000 labels or less to achieve this effect.
Lightweight customization effect display
Our tool can automatically discover these new words in the scene, and make entity label predictions for these new words. You can see that these new words in parentheses are predicted in different contexts, and a distribution of labels guides We determine whether this new word is a legal new word and what the entity tag it belongs to is to provide our model with the most critical information.
Address scene
E-commerce scene
\>>If there is a need for in-depth optimization of search results, you can fill in the expert consultation questionnaire, and participate in the trial to get the open search general word segmentation ability for free. Questionnaire address: https://c.tb.cn/F3.05Srxl
If you want to communicate with more developers, understand the cutting-edge search and recommendation technology , you can scan the code to join the community
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。