About an unforgettable year for everyone in the past 2020. Affected by the new crown epidemic, all walks of life are changing amid challenges, and education has also given birth to a new business landscape. The online education platform is developing rapidly, and Alibaba Cloud is also actively responding to it, providing efficient and stable technical support for many online education customers. This article introduces the technical principles of Alibaba Cloud Open Search, an important tool for online education to grab traffic-photo search.
Shared by: Xu Guangwei (Kunca), Alibaba Dharma Academy Algorithm Expert
Learn more about the solution details: https://www.aliyun.com/page-source/data-intelligence/activity/edusearch
Search is a powerful tool for online education enterprise traffic acquisition
As of December 2020, the statistics of the TOP10 monthly activities in the education industry, including as many as 5 software with the ability to search for questions, as a product capability, it can help customers acquire a large number of users and traffic, thereby providing liquidity for other products. Because of this positioning, the overall accuracy of the photo search and search efficiency has become a crucial point, so Open Search has made a lot of customized optimizations.
Characteristics of Educational Questions
Three characteristics are summarized for the education search question business scene:
first point is a massive question bank . The educational question bank belongs to tens of millions or even 100 million, and it continues to grow. At the same time, there are obvious peaks in the search business, such as seven or eight in the evening, the last day of a holiday, at this time Search questions will have very high QPS peaks; search delay will seriously affect the user experience.
second point is rich in scenes , the scenes covered by photo search are more and more abundant, including different age groups, for example, the lower grade search questions mainly focus on taking pictures to read pictures or connecting questions, and questions that require more picture information; It also includes different disciplines, and currently supports more than ten disciplines, so the rich scenes will bring greater challenges to the search effect.
third point of . The photo search product form generally only displays the results of TOP3 or TOP5. Because of this setting, accuracy is very important for photo search. At the same time, photo search will also involve Multi-modal and multi-language processing capabilities to solve the needs of graphic search and multi-language processing.
Open search education search question scheme architecture
Alibaba Cloud Open Search’s photo search solution. When the user takes a photo of the text after OCR recognition, it will return TOP3-5 results to the user after being processed by the open search engine, and strictly ensure the security of the data for the enterprise question bank data. And privacy.
Educational search algorithm ability
Query analysis algorithm optimizes the complete processing flow
Education industry word segmentation and subject category prediction
There are two major difficulties in word segmentation in the photo search scene. The English topic OCR recognition is missing the space , the first picture on the left can be seen, even for a long English text without spaces, the model can be very accurate. Do the correct segmentation. The after the mathematical problem formula is expressed. In the second figure on the left, you can see that the mathematical symbols are all correctly segmented.
Category prediction corresponds to the subject and the prediction of the question type we combine the picture and the text information after OCR recognition to make multi-modal predictions to improve the accuracy of the search.
Multiple Recall Sorting Technology
Due to the particularity of the business scenario of taking photos to search for questions, open search also introduces multi-channel recall sorting technology.
do multiple recalls?
Educational photo search questions are significantly different from traditional webpages or e-commerce searches. The first point is that the search query is extremely long, and the second point is the text obtained after the searched query is recognized by the photo OCR, and the key TERM recognition error Otherwise, it will seriously affect the recall order.
There are two traditional plain text query solutions. The first is OR logic query, and the second AND logic query. The AND logic query is based on the analysis of the Query module optimized and customized for the education field that we just mentioned, and the results are greatly improved. Now The accuracy can be close to OR logic.
take into account the search calculation cost and the accuracy of the search?
Introduced the vector recall of text, and optimized the text vector recall technology in three points,
The first point is that we use StructBERT self-developed by Dharma Academy for the BERT model, and customized it for the education industry, while compressing and accelerating the BERT model.
The second point is that the vector search engine uses the Proxma engine self-developed by Dharma Academy, and its accuracy and speed exceed the open source system.
The third point of training data can be continuously accumulated based on the customer's search log, and the effect will continue to improve.
As can be seen from the figure on the right, we can finally achieve very good results with the BERT model based on both sides, the accuracy exceeds the OR logic by 3%-5%, the overall number of DOC recalls is reduced by 40 times, and the Latecy is reduced by more than 10 times.
Search results display
For example, there are two specific search cases. In the left case, you can see that because the text description in the question is inconsistent with the text description in the question bank, the results returned by traditional search engines are extremely relevant. After we introduce the semantic vector recall, the TOP3 results on the right are Fully meet the meaning of the question. In the second case, because this topic contains image information, traditional search engines cannot accurately recall. Based on our multi-channel recall, TOP1 returns exactly the same topic after introducing image information.
Advantages of open search solutions
Case 1: A K12 educated user has reached tens of millions of users, and the number of question banks is around 80 million and continues to increase. After the customer accesses the open search, the accuracy of returning the search questions is increased by 45%, and the delay is reduced to 50% milliseconds.
Case 2: A customer in higher vocational education has a product DAU of 3 million and a monthly life of 10 million. After the customer accesses the system, they feedback and compare their original self-built system. It takes more than two seconds at peak times. Now the open search can be stabilized at 50 Milliseconds, down 40 times year-on-year. The search accuracy rate of TOP5 topics increased by 2.4%, and the search results dropped from 40% to less than 1%, and smooth expansion in seconds can be achieved during peak business periods.
get expert guidance:
https://survey.aliyun.com/apps/zhiliao/6R4u6vilI
Copyright Statement: content of this article was contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。