OPPO Xiaobu Assistant algorithm system exploration, practice and thinking

1 Introduction

Dialogue interaction is a very imaginative key technology direction in the next era after the traditional PC, PC Internet and mobile Internet. Both academia and industry have a high degree of attention. At the same time, it is the key to OPPO's integration strategy. One of the nodes carries a great and arduous mission.

Algorithm is one of the core capabilities of dialogue interaction, which determines the level of intelligence that the voice assistant can achieve, and has extremely high technical value. This article will mainly introduce the goals of dialogue and interaction, the key problems to be solved by the algorithm, the status quo and trend of the industry, the main practice and progress of Xiaobu, as well as the challenges and the future.

2 Dialogue interaction goals and key issues

In layman's terms, the goal of dialogue interaction is to complete human-computer interaction processes such as task execution, information acquisition, and emotional communication in a natural dialogue manner through voice or text. For example, intelligent assistants such as Jarvis and Dabai in science fiction movies represent people's expectations about the ideal state of dialogue and interaction capabilities.

Dialogue interaction has received more and more attention in recent years. What is the reason behind it? In fact, looking back at the history of information technology development in the past 40 years, it is not difficult to understand. We know that information technology has gone through several major eras of traditional PC, PC Internet, and mobile Internet. Each of these eras is closely related to equipment, which has led to a revolution in entrance and interaction methods. Now we are moving towards the era of AIoT with high expectations. Dialogue interaction, due to its huge imagination in the new generation of search engines, super service distribution centers, and new interactive methods, happens to carry the next entry-level interactive transformation in this new era. Mission vision.

However, it is very difficult to achieve the ideal dialogue interaction effect, mainly because it needs to leapfrog the current mature perceptual intelligence technology and move towards cognitive intelligence. At present, there are still many problems in the field of cognitive intelligence that have not been fundamentally resolved or even unclear. The problem of definition. Typical cognitive problems include how to express and understand common sense, how to make machines have reasoning and planning capabilities, and how to make machines have the same imagination and autonomy as humans. To a certain extent, it can be said that solving the problem of cognitive intelligence is basically equivalent to realizing strong artificial intelligence, which shows that the difficulty of dialogue interaction is high.

The main process of dialogue interaction is shown in the figure below. It is not difficult to find that almost all key nodes are related to algorithms, and algorithms are the core ability to achieve better dialogue interaction effects.

For OPPO's self-developed Xiaobu Assistant, its algorithm capabilities are shown in the following table. Voice wake-up is mainly supported by three parties and software engineering systems. Currently, the new phone is aligned with the industry's top competitors in terms of effect, but the old model has a high technical upgrade cost. , Some low-end models cannot support voice wake-up and other issues; voice recognition uses the capabilities of Sanfang and OPPO Research Institute at the same time. Due to the mature voice recognition technology, the overall effect is better, and the word error rate can be controlled below 6%. The main problem is audio quality; speech synthesis is similar to speech recognition, and is also supported by the tripartite and OPPO Research Institute. It has good accuracy and fluency, but the evaluation of dimensions such as naturalness and emotionality is very subjective, and it is not available for the time being. Support user personalization; semantic understanding and dialogue capabilities are mainly provided by the business technical team. In terms of semantic understanding, the accuracy and recall rate can reach more than 90%. There is a problem of difficulty in understanding open domain long-tail queries; in terms of dialogue capabilities, the current Supports immersive strong multiple rounds, free switching of weak multiple rounds, multiple rounds of reasoning above, etc. The difficulty of multiple rounds is mainly difficult to evaluate, weak user habits, and low online penetration.

Semantic understanding and dialogue capabilities are the focus of this article. The main task is to first understand what the user wants and then decide what to give to the user after obtaining the user Query, and finally assemble appropriate resources to appropriately satisfy the user. The semantic algorithm system composed of semantic understanding and dialogue capabilities is to achieve the above goals. The system involves two major categories of systemic and technical problems, as shown in the following figure.

Systemic issues include how to decouple and disassemble complex systems that need to support queries in the entire field, hundreds of skills, and multiple devices and channels; and how to be efficient for issues such as multiple product requirements, multiple modules and long processes, and large algorithmic uncertainties. Iteration; how to ensure the experience through effect monitoring for the diversified spoken Query that cannot be exhaustively listed; how to avoid low-level defects, unanswered questions, and excessive “mentally retarded” experiences.

Technical problems include the selection of algorithms, the modeling and solving of key problems, the control of multiple rounds of dialogue, and the guarantee of performance.

3 Industry status and algorithm trends

First of all, dialogue interaction has become increasingly mature in application scenarios, covering many fields such as smart homes, vehicles, life travel, professional services, etc. Convenience and speed are the natural advantages of natural language dialogue interaction methods, which are accepted by more and more users. It is estimated that more than 7 billion devices will be equipped with voice assistants in 2020.

In addition, in terms of development trends, top technology companies have never given up investment in this direction in the past ten years. Foreign companies, represented by Apple, Amazon, and Google, all regard dialogue and interaction as their very important direction; domestic situation Similarly, Baidu, Xiaomi, and Alibaba are all actively deploying, aiming to seize the future traffic portal of dialogue interaction.

A trend worthy of attention is that the dialogue and interaction intelligent assistants for third-party devices are gradually fading out, and each company mainly focuses on its own devices to develop vigorously. In addition to the close coupling of related technologies and devices, there is a more important reason. This entry is so important that no head equipment manufacturer is willing to give it to a third-party technical party.

Dialogue interaction is also a hot topic in academic research. From the trend analysis of ACL papers, it can be seen that the direction of dialogue interaction has emerged in the past five years, and it has become the most popular research direction in 2019 and 2020.

Reference: Trends of ACL: https://public.flourish.studio/visualisation/2431551/

In terms of core cognitive understanding algorithms, the solution paradigm has evolved from a traditional multi-module pipeline solution that relies heavily on language, problem types and manual customization experience to a simpler, universal, and efficient end-to-end integrated solution. The evolution of this paradigm greatly simplifies the problem-solving process, not only can effectively avoid cumulative errors, but also enables big data, large models, and large computing power to be applied to the ground, significantly improving the effect.

In the past two years, a large-scale pre-training model represented by Google BERT at the model level has been born, sweeping the list of major language modeling tasks, and releasing huge potential for the development of more advanced semantic understanding algorithm models. This will undoubtedly Provide solid technical support for the development of dialogue interaction.

In summary, both the industry and academia are very concerned about the direction of dialogue and interaction, which reflects the industry's prediction of future trends. The breakthrough development of algorithm technology has further catalyzed the speed of dialogue and interactive products landing, so that the future will come sooner.

4 Practice and progress of Xiaobu algorithm system

As mentioned earlier, semantic understanding and dialogue capabilities together constitute the core semantic algorithm system of Xiaobu. The following sections will present our practice and key progress in this direction in detail.

First of all, in terms of business requirements, we mainly consider the four dimensions of business boundaries, dialogue capabilities, user volume, and evaluation indicators. In terms of business boundaries, Xiaobu Assistant belongs to a full-scenario open-domain dialogue interaction system. The areas that need to be supported include system control, information query, audio-visual entertainment, life services, intelligent chat, etc., including hundreds of skills and the breadth of user Query Very large; in terms of dialogue ability, in addition to simple command-based command control and single-round problems, it also needs to support multi-round task-oriented, weak multi-round, context understanding capabilities, as well as high-level capabilities such as dialogue recommendation and active dialogue; In terms of user volume, Xiaobu needs to cover Oga Group’s three-brand mobile phones, as well as billion-level equipment such as watches and TVs, and tens of millions of daily activities; in terms of evaluation indicators, the main considerations are demand coverage, intent call accuracy, and skill satisfaction. Degree, response time, etc.

Generally speaking, the mission of Xiaobu Assistant is to establish a dialogue connection. One end of the connection is the large user group of Oga Group’s equipment ecology, and the other end is a high-quality conversational service. This connection is used to realize user value, Marketing value, and technical value.

In order to support the above business requirements, we abstractly summarized four design principles to guide the design of the algorithm system:

Domain Divide and : Use the method of dividing domains to decompose complex problems in the whole domain and transform them into simpler sub-problems to be solved in groups, reducing the difficulty of solving and improving the controllability of the system;

Effect Priority : In order to avoid the experience of "mental disability" as much as possible, it is not limited to any single technology, and the algorithm design is driven by effect priority to avoid low-level defects;

closed-loop monitoring : Establish a complete closed-loop monitoring mechanism. In the R&D phase, the test coverage is improved through the design of multi-party test cases including product, testing, R&D, etc., and real-time dynamic test set monitoring and manual evaluation are used online to ensure the experience;

platform to improve efficiency : In order to cope with the support of many mid-to-long-tail skills, promote the construction of skill platforms, and reduce the cost of R&D and maintenance of mid- and long-tail skills with consistent and common platform-based solutions.

With reference to business requirements and design principles, the overall architecture of Xiaobu Assistant's current algorithm system is shown in the figure below. First of all, in terms of platforms and tools, the basic algorithms are based on mainstream deep learning algorithms in the industry. Algorithm solutions are built on them for different problem types, and further packaged into NLU framework, general graph question and answer, skills platform, open platform and other modules . Then in terms of business, the top layer will use symbolic, structured, and numerical ideas to process queries in general, and then split the business according to the dimensions of system applications, life services, audio-visual entertainment, information query, and smart chat. Independent iteration of business lines. Finally, it combines dialogue generation and fusion sorting to select the best skills to meet the needs of users.

From the processing flow, it can be divided into pre-processing, intent recognition, multi-class ranking, resource acquisition and post-processing. The first three nodes are mainly responsible for the recall rate of the intent, and the latter two nodes are responsible for the resource Responsible for the coverage and the relevance of the results, and the entire process is responsible for the final skill implementation satisfaction.

The key algorithm modules involved in the semantic algorithm system are shown in the following figure. The three core modules of semantic understanding, dialogue management, and dialogue generation will be introduced separately in the follow-up.

Intent recognition is the core module of semantic understanding. Its main task is to infer what the user wants to do through the analysis of the user's current query and interaction history, including typical scenarios of closed domain, open domain, and context.

Slot extraction is a task closely related to intent recognition. The main task is to extract key information from the user's current query and interaction history to assist in accurately obtaining the answer/content required by the user.

Intent recognition and slot extraction together constitute a semantic understanding module. The main difficulty lies in the diversification of spoken language (100 million independent queries); ambiguity (such as Peppa Pig is a cartoon, but also an App); relying on knowledge (such as "can it" It turned out to be a song title).

Dialogue management is another key module of the semantic algorithm system. Its task is to deduce the dialogue state based on the current Query and dialogue context, and then infer the best response of the dialogue system in the next step.

After the completion of semantic understanding and dialogue management, it is necessary to combine dialogue generation to achieve the final appropriate execution feedback of skills. The task of dialogue generation is to obtain appropriate response words through appropriate methods based on the analysis results of semantic understanding and the actions to be performed. .

In terms of algorithm models, Xiaobu is mainly driven by strong deep learning. On the one hand, this type of module has better results, and the technical solutions are also relatively mature, and there are many successful cases.

However, it is worth emphasizing that there is basically no single model of "one trick" in this field to solve all technical problems. Generally, the main model based on deep learning is responsible for ensuring the fundamentals of the effect, and it still needs to be processed in conjunction with customized rules. Badcase on the corner.

In the face of system application manipulation skills, in order to improve the effect of semantic understanding, we mainly adopt a scheme based on the fusion of rules and deep learning models, in which reverse rules are used to quickly reject queries outside the domain, and forward rules are used to cover strong statements. The deep learning model is responsible for generalized recognition of general cases. In addition, in order to improve the joint accuracy of intent and slot, multi-task joint learning is introduced.

Multi-task joint learning can disambiguate intentions and slots. It is mainly applied to skills such as phone calls, text messages, and schedules. Compared with single-task independent learning, the general accuracy rate can be increased by 1% to 3%. Combined with meticulous data-driven optimization and rule verification, it can basically achieve a recall rate of more than 95%.

For knowledge-dependent skills, such as music, radio, film, etc., we mainly adopt a knowledge-integrated intention recognition scheme, as shown in the figure below. The main difficulty of this kind of skills is that the single clause cannot determine the intent. It is very important to accurately extract the resource fields from the Query. The intent identification after fusing the resource association results can significantly reduce the difficulty of problem solving.

Different from the closed domain, the intent recognition of the open domain is difficult to model into a classification problem, and it generally needs to be solved by a semantic matching scheme. For this kind of problem, we mainly adopt the deep semantic matching method, as shown in the figure below. Compared with traditional text symbol-based matching, the effect is better, and the matching accuracy can reach more than 95%; however, there are also problems such as subject recognition and semantic inclusion, which need to be controlled by downstream verification strategies. At present, it is mainly used in information query and small chat QA matching.

In addition, in order to further enhance the effect of semantic understanding, we are also exploring the implementation of large-scale complex models. In the direction of large-scale pre-training language models, the team has improved, retrained and fine-tuned based on the open source model, and achieved rapid improvement in effect. It is currently ranked sixth in the total ranking of the Chinese Language Understanding Evaluation Benchmark (CLUE).

However, this type of model has a high computational complexity and is generally difficult to meet the timeliness requirements of online reasoning. It needs to be combined with acceleration schemes such as knowledge distillation to be applied.

Common knowledge distillation schemes can be divided into two types: data distillation and model distillation. The assumption of data distillation is that the simple model is not as effective as the complex model due to the lack of annotated data. If the complex model is used to provide enough pseudo-annotated data, it can help Simple models are gradually approaching the effects of complex models; the assumption of model distillation is that simple models not only lack enough data, but lack good guidance. If the intermediate results obtained in the process of training complex models are used to guide the training of simple models at the same time The process helps the simple model to approximate the effect of the complex model. Both data distillation and model distillation are applied in Xiaobu Assistant's business.

The dialogue system is also considered to be the next-generation search engine, and users have a lot of requests for knowledge questions and answers. It is expected that they can get accurate answers. In order to meet these needs, we build our own knowledge base through data acquisition and data mining, and then combine Online semantic matching, KBQA, etc. provide question and answer services.

In addition, in order to accurately answer the vertical domain fact-related questions, we have also built a general question and answer capability based on the knowledge graph. For the boutique vertical category, the domain graph is constructed through data cooperation and self-service crawling, and then accurate question and answer based on the template and the graph.

Xiaobu Assistant currently has more than 50% of the head traffic online by self-built knowledge question and answer services, and the long tail part also has strong search companies such as DuMi and Sogou.

In terms of dialog management, commonly used solutions include solutions based on finite state machines, solutions based on Slot-Filling, and end-to-end solutions. The difficulties are flexible process control, context inheritance and forgetting, intention jump, exception handling, etc. , Currently, Slot-Filling mode is mainly used.

In order to achieve a better context understanding effect in multiple rounds, Xiaobu Assistant has implemented a context understanding solution based on reference resolution, which is used to deal with the common problems of reference and omission in multiple rounds of dialogue.

Reference: ACL'2019 Improving Multi-turn Dialogue Modelling with Utterance ReWriter

With the help of dialogue management and context understanding, Xiaobu Assistant has supported immersive strong multiple rounds, free switching weak multiple rounds, contextual reasoning multiple rounds and other modes, covering business scenarios such as task-based, information query, and multi-round chat.

In terms of dialog generation, there are currently three types of dialogues in the industry: template-based, retrieval-based, and model-based. Due to the weak controllability of generative models, Xiaobu currently mainly uses template-based and retrieval-based solutions, and generative models are still being previewed. Research in progress.

In terms of algorithm engineering, in order to quickly go online, a Python-based service framework was provided in the early stage, and multiple instances were deployed to compensate for the weak concurrency of a single service; currently, services with high computational complexity are also exploring the optimization of operator-based engineering reconstruction. , And the joint machine learning platform team to explore a simpler and more efficient service model.

In terms of skills building, in the early stage, in order to quickly go online, the focus was on skills customization research and development; at the end of last year, the construction of the skills platform began. The main idea is to standardize offline model generation and online reasoning processes, operatorize key algorithms, complete skill development through data import and process configuration, and reduce mid-to-long-tail skill support and maintenance costs.

Finally, in order to ensure the interactive effect experience of the dialogue, we jointly built a full-process closed-loop monitoring program with the data team and the evaluation team. First, the R&D self-test ensures that the algorithm model effect meets expectations, and then enters a round of batch testing when the version is released. To ensure that no new risks will be introduced; after the launch, there will be routine monitoring and real-time monitoring to ensure the overall effect and the normal monitoring of key functions respectively; in addition, manual sampling evaluation and tripartite evaluation will be introduced to further monitor the experience.

5 Challenges and future thinking

Although dialog interaction has made great progress in algorithm technology in recent years, there are still many challenges compared to the Jarvis and Dabai expected by users.

First of all, in terms of semantic understanding, the current model is essentially statistical induction based on data, and lacks robustness and completeness in extreme cases.

Secondly, as a candidate with the potential to replace search engines, it is bound to assume the role of "know-how". Then, low-frequency question answering has problems such as open fields, obvious long-tail effects, and heavy reliance on knowledge content. The construction difficulty and cost are very high.

In addition, unlike the relatively mature search and recommendation scenarios, the iterative optimization of dialogue interaction capabilities mainly relies on humans, and it is difficult to connect to the self-feedback and self-learning high-speed engine driven by big data, and it is difficult to improve quickly.

The challenges in the future are far more than that. We will continue to develop stronger semantic understanding capabilities, more profound knowledge, smoother dialogue, more field dialogue management, as well as self-feedback, weak supervision, and self-evolving learning capabilities. Actively explore and make unremitting efforts to create an intelligent assistant with the best user experience in the Chinese field.

Colleagues who are interested in intelligent assistant and dialogue interaction technology are welcome to exchange and discuss together!

Author profile

zhenyu Head of NLP and dialogue algorithm of OPPO Xiaobu Smart Center

Candidates of Shenzhen High-level Talents Program received a bachelor and doctorate degree in computer science from the University of Science and Technology of China.

In recent years, he has focused on the research and development and implementation of dialogue AI key algorithm technology. In 2018, he joined OPPO to lead the construction of Xiaobu assistant NLP and dialogue algorithm system. He has cited more than 800 single representative articles of academic papers for research work, and has won the second prize of scientific and technological progress in higher education institutions (science and technology) once and the second prize of scientific and technological progress in Hunan Province twice.

For more exciting content, please scan the QR code to follow the [OPPO Digital Intelligence Technology] public account

OPPO Xiaobu Assistant algorithm system exploration, practice and thinking

1 Introduction

2 Dialogue interaction goals and key issues

3 Industry status and algorithm trends

4 Practice and progress of Xiaobu algorithm system

5 Challenges and future thinking

OPPO数智技术

引用和评论

OPPO云数据库访问服务技术揭秘

入选ICLR 2025，MIT/UC伯克利/哈佛/斯坦福等提出DRAKES算法，突破生物序列设计瓶颈

怎么判断自己下载的 trae 是国际版还是国内版？

30分钟内输出结果，新加坡国立大学/MIT等基于SVM构建微生物污染检测模型

FlowGram 简介：开源前端流程搭建引擎

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

在线教程丨媲美 o3-mini，开源代码推理模型 DeepCoder-14B-Preview 狂揽 3k stars