Huang Shuo: Application of Baidu Fei Pao Wenxin Large Model in Voice and Text Review

As the next-generation basic technology capability, real-time interaction is supporting and promoting innovative communication and interaction methods between people, objects and spaces.
Voice processing is a very important scene in the field of real-time interaction. In the " RTC Dev Meetup 丨 Technical Practice and Application of Voice Processing in the Field of Real-time Interaction ", technical experts from Baidu, Universal Technology and Yitu focused on this topic. related sharing.
This article is based on the content shared by Huang Shuo, architect of Baidu Natural Language Processing Department, at the event. Follow the official account "Shengwang Developer " and reply to the keyword " DM0428 " to download PPT materials related to the event.

在这里插入图片描述

Application of Baidu Fei Pao Wenxin Large Model in Voice and Text Review
Huang Shuo Architect of Baidu Natural Language Processing Department

Deep learning pre-trained large models have developed rapidly in recent years, subverting many fields previously solved by traditional machine learning techniques. Thanks to the development of large-scale model technology in Baidu, the traditional Internet business of voice and text review has also made great progress in technology.

This article will introduce and expand the overall effect, versatility, adaptation to individual needs, and service performance of large models compared to traditional models, hoping to let everyone understand the advantages of large models, as well as some of the latest development trends and business application effects of audit technology. .

01 Development of Baidu Fei Pao Wenxin Large Model

1. History of large-scale and training models in the industry

In 2018, after Google launched the BERT pre-trained large model, many traditional practices in the field of natural language processing were completely changed. Before that, if you want machines to understand human language, you often have to solve a series of linguistic problems, such as the most basic word segmentation, part-of-speech tagging, entity recognition, core word extraction, and even the dependencies of words in complex sentences in the Chinese field. Only by understanding the relationship can the computer accurately understand the logical relationship in a sentence, so as to complete tasks such as search, correlation calculation or recommendation.

After the large-scale pre-training model for text such as BERT was proposed, large-scale model bases such as GPT, T5, and Baidu's ERNIE were successively launched, which enabled us to quickly understand the tasks related to language and text based on large-scale models. The pre-trained model base can use its understanding of language to directly build the task to be solved in the upper layer.

As shown in Figure 1, around 2014, word vectors similar to word2vec were already practiced in the industry, and Baidu's major upgrades in web search and semantic computing sorting were also launched around 2014. These technologies are in the At that time, it elegantly solved the problem of incomplete keyword matching when searching and sorting. Through these technologies, the computer can also understand the semantics behind the words, but the effect and generalization are not as good as the pre-training models that combined Attention and Transformer and other network structures later. .

■Figure 1

2. The development of deep learning technology framework in Baidu

As mentioned just now, about the calculation of semantic vectors, Baidu has already started related practice around 2013. The timeline in Figure 2 shows the development of Baidu's deep learning technology. Around 2012, Baidu began to develop deep learning-related technologies in the fields of speech recognition and OCR, and the application of deep learning in search was also launched around 2013. At the same time, Baidu also independently developed PaddlePaddle, which is a deep learning framework for flying paddles. Deep learning technology has achieved large-scale applications in various Baidu's main businesses such as image, text, voice, search recommendation, and autonomous driving.

■Figure 2

3. The development of Baidu Fei Pao Wenxin and training model in recent years

Baidu launched the Fei Pao Wenxin large-scale pre-training model in 2019. Today, we will share technology around the various application methods we have tried using the Wenxin large model in auditing technology. In the past two or three years, we have successively released 2.0, 3.0 and various versions of Wenxin large-scale models in different fields, different languages, and different scales.

Figure 3 shows the family of onsen models. The Wenxin large model family is divided into several layers from bottom to top. In fact, it is not only the flying paddle Wenxin large model, but also most of the similar large models in the industry. There are versions with different granularities and versions optimized for different task types. For example, for language generation models and models for information extraction, the model base will have different effects for different tasks. In the upper layer, there is a layer of domain models in the middle. Based on different domains, the large model technology will use different pre-training corpora to create different effects, so in different fields, the effects will be different. The next layer is such as cross-modality and cross-language, that is to say, in addition to text, different information modalities such as voice, image, and document can also be integrated to realize a more-level pre-training model. The top layer represents the application of pre-trained large models with different tendencies, which have been verified in various businesses such as search, recommendation, voice, documents, and customer service.

■Figure 3

What role can Wenxin's large-scale pre-training model play in the voice and text review business? I will share this point from a number of different aspects, including the effect of the large model as a model base? What is the use of distillation technology to solve the performance problems of large models? What role will large models play in data sample enhancement? What is the role of the large model in terms of the individual needs of different users? How can the big model optimize the matching rule strategy in the traditional audit business?

02 Application of Wenxin Large Model in Voice and Text Review

1. Review business characteristics

(1) Background introduction to text review and voice review

Document review is one of the foundations of voice review. In the industry, content review is roughly classified into pornography-related, political-related, advertising violence, abuse, etc., and each type has different review objectives in the data-level subdivision. These different data sources have different difficulties for auditing technology. For example, the content and wording of articles published on news websites will be relatively regular; while texts such as user comments or forum postings will be relatively random in terms of wording and sentence structure. Corresponding to the needs of the review, the subdivision content under each category will also have different needs. For the corresponding technical aspect, the semantic judgment of the thesaurus combined with the model is the most common practice.

In addition to the classic ability to use text auditing after translating to text through ASR, speech data has other characteristics. For example, with voiceprints, we cannot identify whether a sentence is narrated in an angry mood or a calm mood just by the text, but we can actually get this information through the voice. In addition, speech segmentation, translation, error correction, as well as voice advertisements and dialogues synthesized by robots are all features that are different from plain text in terms of voice auditing.

(2) Common difficulties in voice auditing and text auditing technology

Figure 4 illustrates the technical difficulties typically involved in the technical aspects of auditing. The first is the diversity of data, such as press releases, user barrages, and robot voices. The content of the data is very different. The second is the diversity of audit requirements. For various categories such as politics, pornography, and advertising, the audit focus, the detection rate in the data, and the degree of demand for semantic understanding vary in difficulty. The third is that audit services usually have high performance requirements for services. For example, in applications such as voice live broadcast, voice chat, and barrage, the requirements for delay are very high. In addition, many businesses require real-time interception, and cannot accept offline large-scale review and filtering.

■Figure 4

(3) Review the individual needs of business customers

In addition to the above-mentioned general technical difficulties, there are also some common personalized needs from customers. For example, there are differences in the scale of review, such as criticism of different customers. In more serious forums, customers’ requirements are basically zero tolerance; in chat scenarios, customers may accept some not-too-excessive mantras. In this way, even with the same audit requirements, the requirements for the scale may be different.

In addition, in terms of audit categories, for example, the same pornographic audit, the focus of voice and plain text audits may be different, such as some audit requirements involving the protection of minors, etc., the focus on content audit requirements It is different, this is the characteristics of different users. So, how do we try to solve these technical difficulties in combination with large models? I will introduce the following aspects.

2. Wenxin large model base

(1) Audit model combining voice and text

First of all, let's talk about the model itself. Figure 5 is a review model that combines voice and text. Figure 5 shows the combination of voice and text for three different levels of models from left to right. First, we generally model the moderation model as a classification problem. Of course, in some scenarios, in order to identify different levels of review, we will also combine regression modeling techniques, but usually it is a classification problem. The left side of Figure 5 shows the simplest combination of the voice audit model and the text audit model. The process is to combine the two models through the rule strategy after the results are predicted respectively. This is the simplest method. Way. But the effect of this method is not ideal. The model in the middle is integrated, and the text and speech models are expanded at the feature level, so that the models on both sides generate features respectively, and then cross-modal feature layer modeling is performed at this layer. The advantage of this is that, for example, if we want to judge whether a sentence involves pornography, it is not simply a weighted score between the results identified by the voice model and the results identified by the text model, but can be combined with the gender characteristics identified in the voice model. And intonation, as well as special audit features in terms of speech, through the overall yellowness, yellowish feature words and other detailed features identified from the text model, let the model make the final judgment. This is a good modeling method that we have found so far. On the far right is a multimodal end-to-end modeling approach, which expands speech and text at the embedding layer of semantic understanding, and models directly after the layer crosses. In the long run, this end-to-end approach is more general, more elegant, and scales better to video. For example, we can also introduce image features into the embedding layer, which is currently used in our large models for document understanding and video image understanding.

■Figure 5

(2) Effect of pre-training model base

Figure 6 shows the effect of pre-training a large model using the Wenxin model. It can be seen that using this large model of language understanding as a base has far more effect than simply adding training data. The purple line in Figure 6 represents the baseline model, which is probably the baseline model we used a year or two ago. Here, the horizontal axis is the model effect on different audit dimensions, and the vertical axis is compared with the baseline model. There are two methods, yellow It means that the training data is continuously increased. When the training data is continuously increased, the effect of the model is obviously improved. The orange shows the trend after the model base is pre-trained for a large model. It is found that the effect has been improved as a whole, which can be said to be far beyond the effect of spending a lot of time accumulating training data.

■Figure 6

Using pre-trained large models, in addition to directly replacing the semantic understanding layer can bring significant effects, we also use this domain pre-training method that pre-trained models can be adapted to the domain to create a variety of scene-based audits in different scenarios Model. As shown in Figure 7, the left side is a relatively simple flow chart, showing the process of domain training. For the existing Wenxin pre-training model, we will add a large-scale unlabeled domain corpus for domain pre-training. In this way, the base of the model can better understand the semantics of a specific field, and when it is used for the training of the upper-level review model, the trained model can better adapt to the scene effect. On the far right is a comparison chart of the effect, which is our evaluation of the effect in the game scene. The first four colors are a comparison of the performance of our general model and the models evaluated internally by several friends. It can be seen that under different audit dimensions, the effects have their own strengths. However, after using the game scene model pre-trained in the domain, it can be clearly seen that in all dimensions, whether it is comparing with friends or the previous general model, the effect is significantly ahead.

■Figure 7

3. Large-scale distillation

As we all know, a large model requires a huge amount of computation for both training and prediction. How to use a large model in a scenario with high performance requirements such as auditing services? Next, we introduce the effect of large model distillation for performance issues. Everyone may know about the large model distillation technology. Through data distillation or model distillation, the effect will be lost to a certain extent, but the performance will be significantly improved. Specifically in our business, you can see Figure 8. The orange line in the left figure shows that after the large model has been distilled and compressed into a smaller model structure, the model effect used to predict the service is better than the complete model. The effect of the trained large model will indeed decrease slightly, but compared with the small model or the small model after adding a large amount of training data, its effect has been significantly improved.

■Figure 8

The upper right corner is our evaluation of service prediction performance. I don't list the specific performance improvement figures here, because the structure of the small model after distillation is consistent with the baseline model, so its prediction performance is the same as before using the large model. . It can be seen that the performance gap with the complete large model is about tens of times of magnitude.

4. Sample Enhancement

In addition to being directly used in the model layer, using large model technology, we can perform some interesting operations at the data sample layer. Here are two examples of sample enhancement. As mentioned earlier, in the process of reviewing business, due to the diversity of customer needs, we cannot label a large amount of training data based on each customer's needs to optimize the model, so how to obtain low-cost A large number of valid training samples is a critical issue. Figure 9 shows how we use the Wenxin model to achieve the purpose of sample enhancement based on labeled training data and unlabeled data. The left side of Figure 9 shows the pre-trained large model for the generation task based on the labeled training data using the Wenxin large model. Here we use ERNIR-Gen. Using this model, a large number of samples that are similar to the training data can be generated, and then combined with simple filtering rules such as similarity or matching, a large amount of generated training data can be obtained at low cost.

在这里插入图片描述

■Figure 9

The process shown on the right side of Figure 9 is to first collect a large amount of unlabeled data in online business, and then add a small amount of labeled data through the semantic clustering calculation of the pre-trained large model, and cluster the unlabeled data. The result of a cluster is combined with the distribution of label data, and it can be seen which locations have the same label with a high probability, so as to obtain a large batch of cluster-based training data from online.

In addition, using large models for sample enhancement, in addition to generating or clustering methods, can also perform some more targeted tasks. For example, when dealing with the audit business, the audit model is often required to be generalized. The so-called generalizability means that certain "variants" can be covered, which is a common textual expression for bypassing auditing techniques in the industry. For this problem, we take advantage of the generalization of the large model and learn from the modeling methods in text error correction technology. In addition to semantic modeling of the features of single characters, we also model both pinyin and stroke information. , so that the large model can understand words with the same pronunciation or similar strokes, so that the large model has a certain variant recognition ability. Of course, if the direct modeling large model is used to audit the business, its effect is not satisfactory. Therefore, based on the large model of variant recognition, we can target and dig suspicious variants and samples from the outside through data enhancement, and then use the variant detection model to identify offline, and add the samples to the model after verification. In the training data set, the audit model can continuously improve the recognition effect of the variant.

5. Expansion of personalized needs

As mentioned in the introduction of technical difficulties, the data sources of different customers are different. This has led to the long-term accumulation of a large number of sub-models with different effects in different scenarios in addition to a few general audit models. How to make the system intelligently select the optimal model combination for different users is very important for the system as a whole. is a difficult problem, for which we try an adaptive multi-model scheduling framework.

The first is to use large models to semantically cluster customer data, so that similar types of data are clustered together. As shown in Figure 10, the middle yellow layer is an indication. Of course, we cannot know exactly what different types of data look like, but their length, text feature distribution, and other aspects will have certain characteristics, which make data with similar requirements cluster together. After these data are gathered together, the selection of the optimal model is carried out. In this way, it can not only solve the individual needs of customers, but also avoid the infinite expansion of the whole system, because it is impossible for us to continuously increase the effect-optimized sub-models for each customer. Through this semantic clustering, operators can also intervene in the selection layer of the model, so that they can analyze the effect and optimize the effect of customer needs in a more targeted manner.

■Figure 10

In addition, in order to meet the needs of different customers on the audit scale, the output of the audit service is adjustable to support the threshold. But we often encounter a problem when training the model, that is, the probability distribution of the model prediction results is often clustered in a small interval. For example, as shown in Figure 11, 90% of the results predicted by the model may be concentrated within the probability range of 0.4 to 0.6, which will cause customers to set a threshold of 0.8 or 0.9, and they cannot obtain accurate and satisfactory results. recall rate. Therefore, we tried different model modeling methods. For example, we introduced the Pairwise modeling method in terms of the severity of insults, and tried more fine-grained binning annotations instead of simple data annotation. 0/1 callout. This makes the model more sensitive to the degree of auditing, which to a certain extent achieves the purpose of stretching the distribution of prediction results to a wider range.

■Figure 11

6. Large model optimization matching rules

On the far left of Figure 12 is a simple word matching rule flow. Word matching technology is a necessary link in the traditional audit system. It can be said that word matching technology has been produced since the emergence of Internet censorship requirements more than ten years ago. Word matching technology has the characteristics of simplicity, quick effect, and accuracy, but it has no generalization. At the same time, due to long-term maintenance, it is easy to cause conflict of rules, and inconsistent historical rules or word update standards are all difficult problems in long-term maintenance.

■Figure 12

In Figure 12 I would like to illustrate three examples of three ways to use large models to optimize word matching systems. Among them, the upper part of the flowchart uses unlabeled data, and the following part uses labeled data. First, there are two ways to optimize word matching rules based on unlabeled data. The first is to combine the unlabeled data with the existing word matching rules, and a large number of uncertain matching results can be obtained. We are not sure of the correctness of the results obtained by these matching rules. Then, the large model can be used to perform a secondary audit check, which can get some high-confidence false matching results. Using these erroneous samples, the thesaurus or matching rules can be reversely cleaned, which is a direct and efficient approach. Another method is to obtain some uncertain samples after matching unlabeled data, and then perform semantic clustering on these samples. Note that at this time, not only matching samples should be added, but also some unmatched samples, that is, samples that do not hit the audit rules, should also be added. After semantic clustering, the label distribution of each cluster can be analyzed. We can find that the labels of some clusters are very consistent, such as 100% positive samples or 100% negative samples, and some clusters have conflicting labels. The semantics of the same cluster are very similar, so in the case of half hits and half no hits, the samples with contradictory labels can be found, and these samples can also be used to clean the dictionary and matching rules. The third method is to use a small amount of existing labeled data to generate a batch of samples with the same label and similar text through large model technology, and then use it to check the word matching rules, which can have a similar effect. However, we usually use the first two, because we can often use the latest online business data to verify the accuracy of historical vocabulary and rule policies.

03 Baidu voice and text review industrialization development

Figure 13 shows the technical panorama that the audit business relies on, from the data layer to the basic algorithms, including lexical analysis, syntactic analysis, semantic computing and other technologies. The blue part shows the details of various functions supported by the audit business. The top layer is Review technically supported products. It can be seen that in addition to external support, the audit technology also supports Baidu's important product business, such as input method, Baijiahao, etc. In terms of external business, Baidu's content auditing has been widely used in various common content production/distribution scenarios, such as live video broadcasting, community social networking, and online education. In terms of service access, Baidu's content review supports online public cloud access and privatization deployment.

■Figure 13

About RTC Dev Meetup

"RTC Dev Meetup" is a technology sharing and exchange activity initiated by Shengwang. It invites outstanding front-line technical experts in the industry to share practical experience around the key technologies involved in the development of real-time audio and video applications, involving mobile development, audio and video technology, computer vision , etc.

Click here at the end of the article to visit the Shengwang developer community for more information about the event.

Huang Shuo: Application of Baidu Fei Pao Wenxin Large Model in Voice and Text Review

01 Development of Baidu Fei Pao Wenxin Large Model

1. History of large-scale and training models in the industry

2. The development of deep learning technology framework in Baidu

3. The development of Baidu Fei Pao Wenxin and training model in recent years

02 Application of Wenxin Large Model in Voice and Text Review

1. Review business characteristics

2. Wenxin large model base

3. Large-scale distillation

4. Sample Enhancement

5. Expansion of personalized needs

6. Large model optimization matching rules

03 Baidu voice and text review industrialization development

RTE开发者社区

引用和评论

ElatoAI：开源 ESP32 AI 语音 AI 玩具方案；凯叔推出 AI 故事玩偶「鸡飞飞」丨日报

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Mac 安装 DeepSeek-R1 本地化部署