"A Thousand Words Dataset: Text Similarity" authoritative evaluation, NetEase Yizhi tops the list

网易数帆
中文

A few days ago, NetEase Shufan’s artificial intelligence technology and service brand, NetEase Yizhi, defeated many powerful players and topped the list in the "Thousand Words Dataset: Text Similarity" industry evaluation jointly organized by CCF and Baidu.

Text similarity, which is to identify whether two paragraphs of text are semantically similar, is an important research direction in the field of natural language processing (NLP). It has been widely used in intelligent customer service, information retrieval, news recommendation and other fields. NetEase Qiyu's intelligent customer service of 400,000 corporate customers is backed by this technology.

image.png
"NetEase Hangzhou Research Institute" in the list is the NetEase Yizhi team

Knowledge precipitation and technology accumulation contribute, NetEase Yizhi text similarity tops the list

The "Thousand Words Dataset" series of evaluations is a large-scale competition in the field of Chinese natural language processing. The text similarity open source project collected public data sets such as LCQMC and BQ Corpus from Harbin Institute of Technology, and Google's PAWS-X (Chinese). It is expected that a comprehensive evaluation of the effect of the text similarity model will be carried out to promote the application and development of text similarity in the field of natural language processing.

It is understood that with the support of related papers, these public data sets have conducted a more comprehensive assessment of the existing public text similarity models, have high authority, and represent the highest level of text similarity technical research.

image.png
Harbin Institute of Technology (Shenzhen) LCQMC dataset task example

In this text similarity evaluation, NetEase Yizhi combined years of accumulated technical experience, and the use of large-scale pre-training language models, coupled with targeted optimization of competition tasks, and achieved current excellent results.

The participating teams of NetEase Yizhi stated that there are two main difficulties in the task of this competition. One difficulty is that the BQ Corpus data set is data in the financial field. This data set involves a large amount of knowledge in the financial industry, while the general pre-training language model is difficult to capture the potential knowledge of a specific industry. To this end, the team used semi-supervised learning and other methods to dig out pan-financial knowledge from multiple business scenarios within NetEase, and then obtained the pre-trained language model in the financial field. In the end, the task was significantly ahead of other participating teams. .

Another difficulty is the quality of the PAWS-X data set. The data comes from the English translation. The translation content is different from the real Chinese. In particular, the algorithm will interfere with the translation of entity words (such as person names and place names). , That is, the same person’s name, the first sentence is kept in English, while the latter sentence is transliterated into Chinese. In response to this data feature, Netease Yizhi uses self-developed NER (Named Entity Recognition) service to identify and normalize entity words, and uses self-developed Chinese text error correction service to correct the typos and language problems before proceeding. The model was trained and finally achieved the first place in this task.

Netease Yizhi helps Qiyu Robot understand customer demands accurately

intelligent dialogue system based on text similarity and other series of NLP technologies, serving multiple businesses within the group, such as carefully selected customer service, IT consulting, etc., and co-developing intelligent customer service robot products and services with Qiyu business Customers outside the group.

Take Joyoung Co., Ltd. as an example. One of its core demands is to ensure the user’s shopping experience through efficient, accurate and user-friendly consulting services, such as the user’s understanding of the functions, operations, prices, preferential activities, maintenance, and repair of small household appliances. Consultation on other issues.

To this end, Joyoung has connected to the Netease Qiyu online robot to provide an intelligent service experience with a better understanding of users on the basis of a problem matching rate of more than 90%. based on the NetEase Yizhi text similarity algorithm. Seven fish online robots realize core semantic matching, thereby achieving BOT, FAQ and other functions. In addition, through semantic matching technology, Seven Fish Online Robot also realizes the intelligent mining and generation of knowledge base. With these capabilities, Seven Fish Online Robot can efficiently and accurately answer customer questions in different scenarios.
image.png
In the field of express delivery, Shentong Express has also connected to Qiyu's intelligent customer service to deal with express consulting issues. This is a completely different field from the above-mentioned finance and small household appliances. However, using the same technical principles of NetEase Yizhi, intelligent customer service has quickly achieved similarities. Effect.
image.png

NetEase Yizhi NLP promotes digital business innovation

The commercial value of text similarity technology is not limited to the field of intelligent customer service. According to the person in charge of NetEase Yizhi, the text similarity technology is broadly classified as text matching. In addition to the dialogue engine, this technology has more applications NetEase, such as 160d06a3e27fc9 NetEase Cloud Music comment intelligent mining, live broadcast/short Innovative solutions such as lyrics matching in videos and similarity detection of video topic selection in the knowledge highway business are applied .

From the perspective of the entire technical field, as a technology that allows machines to understand human language, NLP is known as the "jewel in the crown of artificial intelligence". It is not only a frontier subject that is difficult to overcome, but also has important significance for digital business innovation. In addition to text similarity, NetEase Yizhi has also been exploring the greatest common divisor of NLP technology and business innovation, and has achieved some phased results.

For example, the use of semantic analysis technology in software testing significantly improves the level of automation and achieves cost reduction and efficiency enhancement, which is very beneficial to the guarantee of digital software quality; text error correction technology is used on a large scale in Netease news and other manuscript review scenarios , To discover and correct errors in spelling and grammar in time, greatly improving the user's reading experience and reducing the workload of content production.

In the future, NetEase Yizhi will also cooperate with several teams under Netease explore the application of NLP in big data systems , such as supporting the natural language interaction between business personnel and analysis systems, so that enterprises can better play the value of big data .

阅读 715

网易数帆社区专栏
专用于分享网易公司技术、产品、运营经验。

网易数帆源自网易杭州研究院,是网易数字经济的创新载体和技术孵化器。

339 声望
534 粉丝
0 条评论
你知道吗?

网易数帆源自网易杭州研究院,是网易数字经济的创新载体和技术孵化器。

339 声望
534 粉丝
文章目录
宣传栏