Abstract: The structure of the deep neural network model described in this article: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions.
This article is shared from the HUAWEI CLOUD Community "What? The open source code of speech synthesis will not run, I will teach you how to run Tacotron2, Author: Baima Guopingchuan.
Tacotron-2:
TTS papers: https://github.com/lifefeel/SpeechSynthesis
Tensorflow implementation of DeepMind's Tacotron-2. The structure of the deep neural network model described in this article: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions
github address: https://github.com/Rookie-Chenfy/Tacotron-2
There are some other versions of Tacotron2 open source projects:
https://github.com/Rayhane-mamah/Tacotron-2
https://github.com/NVIDIA/tacotron2
This github contains other improvements and attempts to the paper, so we use the paper_hparams.py file, which saves precise hyperparameters to reproduce the results of the paper without any additional additional functions. The suggested hparams.py file used by default contains hyperparameters with additional content that can provide better results in most cases. Modify the parameters at will according to your needs, and the differences will be highlighted in the file.
Repository Structure:
Step (0): Get the data set, here I set the examples of Ljspeech, en_US and en_UK (from M-AILABS).
Step (1): Preprocess your data. This will provide you with the training_data folder.
Step (2): Train your Tacotron model. Generate the logs-Tacotron folder.
Step (3): Synthesize/evaluate the Tacotron model. Give the tacotron_output folder.
Step (4): Train your Wavenet model. Generate the logs-Wavenet folder.
Step (5): Use Wavenet model to synthesize audio. Give the wavenet_output folder.
Notice:
Steps 2, 3 and 4 can be completed by simple operation of Tacotron and WaveNet (Tacotron-2, step (*)).
The original github preprocessing only supports Ljspeech and Ljspeech-like data sets (M-AILABS voice data)! If you store the dataset in a different way, you need to make your own preprocessing script.
If two models are trained at the same time, the model parameter structure will be different.
Some pre-trained models and demos:
You can view some of the main insights (in the pre-training phase) of the model's performance here.
Model architecture:
Figure 1: Tacotron2 model structure diagram
The model described by the author can be divided into two parts:
Spectrogram prediction network
Wavenet Vocoder
To explore the model architecture, training process and preprocessing logic in depth, please refer to the author's wiki
How to start
Environment settings:
First, you need to install python 3 along with Tensorflow.
Next, you need to install some Linux dependencies to ensure that the audio library is working properly:
apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools
Finally, you can install requirements.txt. If you are an Anaconda user: (you can use pip3 instead of pip and python3 instead of python)
pip install -r requirements.txt
Docker:
Alternatively, a docker image can be built to ensure that everything is set up automatically and the items inside the docker container are used.
Dockerfile is insider “docker” folder
The docker image can be built with the following content:
docker build -t tacotron-2_image docker/
Then the container can run:
docker run -i --name new_container tacotron-2_image
data set:
The github tested the above code on the ljspeech dataset, which has a single actress recording marked for nearly 24 hours. (When downloading, more information about the data set is provided in the README file)
The github is also running current tests on the new M-AILABS speech data set, which contains more than 700 speeches (over 80 Gb of data) and more than 10 languages.
After downloading the data set, unzip the compressed file, and the folder is placed in the cloned github.
Hparams settings:
Before proceeding, you must choose the hyperparameters that best suit your needs. Although it is possible to change the hyperparameters from the command line during preprocessing/training, I still recommend making the changes directly on the hparams.py file once and for all.
In order to choose the best fft parameters, I made a griffin_lim_synthesis_tool notebook, you can use it to invert the actual extracted mel/linear spectra and choose the level of preprocessing. All other options are well explained in hparams.py and have meaningful names, so you can try them.
AWAIT DOCUMENTATION ON HPARAMS SHORTLY!!
Pretreatment
Before running the following steps, make sure you are in the Tacotron-2 folder
cd Tacotron-2
Then you can use the following command to start preprocessing:
python preprocess.py
You can use the –dataset parameter to select a data set. If you use the M-AILABS data set, you need to provide language, voice, reader, merge_books and book arguments to meet your custom requirements. The default is Ljspeech.
Example M-AILABS:
python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'
Or if you want one speaker to use all books:
python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True
This should not take more than a few minutes.
train:
Train two models in order:
python train.py --model='Tacotron-2'
The feature prediction model Tacotron-2 can be trained and used separately:
python train.py --model='Tacotron'
It is recorded every 5000 steps and stored in the logs-Tacotron folder.
Of course, training wavenet alone is done in the following way:
python train.py --model='WaveNet'
logs will be stored inside logs-Wavenet.
Notice:
If the model parameters are not provided, the training will default to Tacotron-2 model training. (Different from tacotron model structure)
The parameters of the training model can be referred to train.py. There are many options to choose from
Wavenet preprocessing may have to use the wavenet_proprocess.py script alone
synthesis
Synthesize audio in an end-to-end (text to audio) way (two models run simultaneously):
python synthesize.py --model='Tacotron-2'
For the spectrogram prediction network, there are three types of mel spectrogram prediction results:
Reasoning test (comprehensive evaluation of custom sentences). This is what we usually use after we have a complete end-to-end model.
python synthesize.py --model='Tacotron'
Natural synthesis (let the model make predictions separately by inputting the output of the last decoder to the next time step).
python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=False
Effective alignment synthesis (default: the model is forced to be trained under valid real labels). When predicting the mel spectrum used for training wavenet, use this synthesis method. (As described in the text, produce better results)
python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=True
Synthesize the waveform with the previously synthesized Mel spectrum:
python synthesize.py --model='WaveNet'
Notice:
If no model parameters are provided, Tacotron-2 model synthesis will be used by default. (End-to-End TTS)
For the selected synthesis parameters, you can refer to synthesissize.py
References and source code:
Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions
Original tacotron paper
Attention-Based Models for Speech Recognition
Wavenet: A generative model for raw audio
Fast Wavenet
r9y9/wavenet_vocoder
keithito/tacotron
If you want to know more about the dry goods of AI technology, welcome to the AI area of HUAWEI CLOUD. There are currently six practical camps such as AI programming and Python for everyone to learn for free. (Six actual combat camp link: http://su.modelarts.club/qQB9)
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。