Application Practice of Speech Synthesis (TTS) Technology in Youdao Dictionary Pen

1. Background introduction

请添加图片描述
Since the launch of Youdao Translation Egg in October 2017, NetEase Youdao has successively launched more than 20 smart learning hardware products, including Youdao Translation King, Youdao Pocket Printer, Youdao Super Dictionary, Youdao Dictionary Pen, Youdao Tao Hearing Bao and so on.

Among them, Youdao dictionary pen created the smart dictionary pen category , which has won the top sales on Tmall and JD.com for two consecutive years, and has been widely praised by users.
请添加图片描述

In the near future a new way dictionary pen software upgrade (related reading:! The new software upgrade is really useful material), there two important optimization , namely:

close to the real person. Say goodbye to the mechanical
words are read aloud correctly 161c046228a5c3

Application effect:

First of all, we have further upgraded the pronunciation system to make Chinese and English pronunciation as close as possible to real people.

In order to provide users with a better experience, the Youdao AI team selected a variety of real-person pronunciation materials, and selected a large enough sample from the company, real users, and native speakers to distribute the questionnaire. The pronunciation accuracy of The timbre favorite degree and other aspects were scored, and compared with the professional pronunciation, and finally the timbre in the current version was selected.

An
Among the pronunciations we selected, there are some star voices . Friends can guess who she is?
Who is she？（01）
Who is she？（02）
(The answer is announced at the end of the article)

In the language learning scene, mechanical pronunciation not only makes people feel boring, but also affects the effect of spoken language learning. most natural and ideal interaction to communicate through human voices. How to make the pronunciation of intelligent learning hardware close to real people is an important topic.

Chinese:
Mechanical Pronunciation-Chinese
Youdao is infinitely close to the human pronunciation-Chinese
English:
Mechanical Pronunciation-English
Youdao is infinitely close to the human pronunciation-English

At the same time, through the continuous training of the language model by the Youdao AI team, the pronunciation accuracy of Youdao dictionary pen has once again broken through. In the process of scanning sentences, Youdao dictionary pen can quickly predict the semantics and easily read and learn some English. Words that are both prone to words ".

Take the sentence containing "read past tense" as an example, let's listen to the pronunciation of Youdao dictionary pen and traditional mechanical pronunciation:

She picked up the letter and read it.
She picked up the letter and read it.

In this sentence, the verb read is the past tense and should be pronounced /red/.

Traditional Scheme-
Youdao-Accurately read

Behind these capabilities is the Youdao TTS speech synthesis technology . This article will introduce the relevant thinking and practice of Youdao TTS technology in detail.

2. Youdao TTS speech synthesis technology

Youdao TTS speech synthesis technology modeling process includes text analysis module, acoustic model module and vocoder module .

2.1 Unified TTS text analysis front-end based on open source BERT multi-task

The main function of the text analysis front-end is to convert sentences into linguistic features. The main is the phoneme sequence and prosody feature , where the phoneme sequence determines whether the TTS reads the text correctly; the prosody feature determines the pause position, naturalness, etc. of the TTS This is also the key to Youdao's TTS technology being able to achieve close to human pronunciation and correct pronunciation of polysyllabic words.

The traditional text analysis module will model each task separately, and the serial processing efficiency is low. This approach is difficult to achieve a balance between performance and quality in embedded scenarios, and the separation of multiple tasks will also increase the maintenance cost of the system.

Compared with the traditional solution, Youdao AI team performed multi-task modeling based on the BERT pre-training model, and unified modeling of multiple tasks, which greatly improved efficiency.
请添加图片描述

These optimizations can support TTS front-end text regularization, polyphonic character identification, prosody prediction and other tasks, so that Youdao system can synthesize low pronunciation errors, natural rhythm and emotional rich high-quality speech on the device side.

Youdao dictionary pen scene TTS front end also faces some challenges :

Meet the requirement of close to 100% pronunciation accuracy; in Chinese and English, a large number of polyphonic characters and polysyllabic words are the key to the accuracy of pronunciation. Moreover, for the education scene of youdao dictionary pen, the correct pronunciation of ancient poems and classical Chinese is also required Full coverage.
Prosody feature modeling meets the needs of natural and clear semantics in TTS synthesis.
The equipment resources of the dictionary pen are limited. While meeting the above two quality points, it also needs to meet the performance requirements.

Based on these problems, we mainly did the following work, namely resource collection, model experiment, system integration :

Resource Collection : In the resource collection stage, with the help of Youdao’s unique teaching and research resources, collect and organize polyphonic word lists,
Combining parts of speech, word meaning, etc. to refine the polyphonic character model tags to make the modeling more efficient; in the pronunciation of ancient Chinese poems and classical Chinese, the massive authoritative pronunciation dictionary resources of dictionary pens are applied to TTS pronunciation through ssml technology;
Model experiment : In the model experiment stage, the front end includes tasks such as polyphonic characters, prosody prediction, word segmentation, and part-of-speech prediction.
By constructing a bert multi-task model to jointly predict polyphonic characters, prosody, word segmentation, and part-of-speech tasks, the mutual promotion of multiple tasks not only improves the accuracy of polyphonic character models and prosody models, but also saves the amount of parameters; finally, through distillation technology, The multi-task model with small parameter amount can not only ensure the quality, but also meet the embedded performance requirements;
System Integration : In the system integration stage, the engineering team used self-developed bert pipeline technology to further optimize memory and reasoning time;

Through these aspects of work, the multi-task architecture based on the pre-training model was , which ensures the correct pronunciation and prosodic pauses of the TTS synthesis.

2.2 Non-autoregressive VAE acoustic model

The main function of the acoustic model is to convert linguistic features into corresponding acoustic features. Common neural network acoustic models can be roughly divided into two categories:
One is the autoregressive acoustic model : For example, Tacotron and Tacotron2, the advantage is high naturalness, but the disadvantage is poor performance; the autoregressive acoustic model based on attention is difficult to model long speech, and it is more prone to lost characters and repetition.

second is a non-autoregressive acoustic model : For example, Fastspeech, Fastspeech2, the advantage is that the acoustic features are generated in parallel, the performance is good, and the modeling of long sentences is robust enough; the disadvantage is that the prosody modeling is slightly worse than the autoregressive acoustic model.

Comprehensive quality and performance, Youdao AI team finally chose based on VAE's non-autoregressive acoustic model . The reason is that it has the following advantages:

In terms of robustness: better than Tacotron2;
In terms of performance: as fast as Fastspeech, faster than Tacotron2;
In terms of quality: close to Tacotron2, easier to train than Fastspeech.

At the same time, we focused on counted as part of the operator takes a larger proportion of the total length of time optimized on engineering problems, to further improve the overall system real-time rates.
In addition, the model is also quantified, which reduces the memory of the model.

2.3 Vocoder based on GAN

The role of the vocoder is to convert the acoustic features output by the acoustic model into a speech time-domain signal. It directly affects the sound quality of synthesized speech, so it is very important to the user experience.
In the actual development of Youdao smart hardware products, the research and development of vocoder technology faces several difficulties:

One is the sound quality problem . The insufficient modeling ability of the vocoder model will directly cause the synthesized speech to produce noise or electric tones. But if you simply increase the parameters of the model, it will affect the system's inference speed.

second is the performance problem . The calculation amount of the vocoder occupies a relatively large amount in the whole framework of speech synthesis. To synthesize high-quality speech in an embedded scene, a vocoder model that is large enough and has strong modeling capabilities is required.

However, due to the weak computing power of the device chip and small memory, a large vocoder will cause the experience delay to increase significantly. From the user's point of view, if the delay is too long and the user waits for too long, there will naturally not be a good experience effect.

In order to solve the above problems, through a large number of experiments and comprehensive comparisons, the Youdao AI team finally chose a vocoder based on the GAN solution.

For any academic program to be realized as an industrial product, a lot of experimentation and polishing are required.

The first is to use different model configurations for different scenarios. The Youdao AI team has fine-tuned the parameters of the generator module in the GAN vocoder, so that it can be successfully applied in embedded scenarios, which is different from traditional parametric vocoders. The GAN-based neural network vocoder can synthesize highly natural and high-definition audio. shortens the quality gap between offline TTS and online TTS.

请添加图片描述

In addition, we have also done a lot of work on the quantization and compression of the model, which greatly improves the speed of speech synthesis and significantly reduces the resource occupation of the system.

3. Summary

In the human-computer interaction of intelligent hardware products, speech synthesis technology plays a very important role, but it faces many challenges in landing. is the contradiction between hardware computing resources and the quality of synthesized speech.

How to provide high-quality speech synthesis technology with limited resources faster and more stably is the goal and focus of the Youdao AI team.

At present, Youdao TTS speech synthesis technology has been applied to many internal and external online scenes and embedded scenes, and has shown a more stable and robust synthesis effect than traditional solutions.

-- END --

Easter egg answer

请添加图片描述

Application Practice of Speech Synthesis (TTS) Technology in Youdao Dictionary Pen

1. Background introduction

Application effect:

2. Youdao TTS speech synthesis technology

2.1 Unified TTS text analysis front-end based on open source BERT multi-task

2.2 Non-autoregressive VAE acoustic model

2.3 Vocoder based on GAN

3. Summary

有道AI情报局

引用和评论

速来体验！基于有道子曰的翻译大模型2.0正式上线

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略

常见的 AI 模型格式