头图

漫游语音识别技术——带你走进语音识别技术的世界

There are ancients before, and Xiao Wang afterwards. Hello everyone, I am Xiao Wang's senior who you love to think. Today I will take you to take a tour of the current development of hot voice recognition technology. It is easy to understand and full of dry goods. You must learn to the end. Yo!


When I see voice recognition, I don’t know if you have thought of intelligent voice interactive assistants, Apple’s "Siri", Huawei’s "Little E", OPPO’s "Xiaoou", Xiaomi’s "Xiaoai classmates", there is always one for you to contact In the past, there are also the currently developed smart speakers "Xiaodu Xiaodu", Tmall Genie, WeChat's "voice-to-text function", "smart appliances", and car networking human-computer interaction systems. These all rely on voice recognition technology. Achieved.

img
Application scenario

The computers we usually use are mostly Microsoft's windows series, and the voice assistant Xiaona is even more familiar to everyone. So what exactly is speech recognition technology?

1. What is speech recognition technology?

Speech recognition is a technology that converts words spoken by a person into text. It is also known as (Automatic Speech Recognition, ASR). Simply put, it is to communicate with the machine so that the machine understands what you mean. To use a broader concept is to collectively call all the technical means from the time when a human speaks to a computer understands what the human says is speech recognition.

In technical terms, it is the high technology that allows the machine to transform the voice signal into the corresponding text or command through the process of recognition and understanding.

Someone may ask here what is the difference between speech recognition and natural language processing (NLP). Speech recognition is a relatively basic branch of natural language processing. In many cases, you have to let the machine know what you are talking about before you can further let the machine understand and make a specific response. Other sub-categories include machine translation, search, summarization, Q&A, and so on. In a word, speech recognition technology is a part and branch of natural language processing.

Okay, let's roam the voice recognition technology, we know the simple concept of voice recognition, and then briefly understand the history of voice recognition.

2. The history of speech recognition

Since the birth of computer speech recognition ( 20 1950 ) since, it has been a dream of human technology. In previous science fiction movies, humans used voice to convey instructions to the computer. In the American movie "2001 A Space Odyssey" released in 1968, the computer HAL9000 on board the spacecraft communicated with the flight attendants through voice. In the American TV series "Star Trek", which has been broadcast since 1966, the protagonist can obtain the data of the planet he intends to explore as long as he asks the computer by voice. Since the computer was invented, humans have firmly believed that the era of using voice to drive computers will finally come.

The research on speech recognition officially began at in the 1960s, , during this period, people tried to extract the association rules between the spectrogram 0 and phoneme 2 of the speech. A prototype of a typewriter based on spectrogram work was exhibited at the World Exposition held in Osaka in 1970.

Entering In the 1970s, , people developed a dynamic programming (Dyamic Pogramming, DP) matching method. This method can stretch and match the respective characteristics of the input voice and the sample voice according to the time axis. Based on this technology, people have successfully improved the recognition speed of short sentences containing a small number of words by a large margin.

20 1990 later, based on voice recognition statistical methods become mainstream, appeared on the market for the general user computer dictation software, you can convert the input voice into text output.

Three, the principle of speech recognition

Since the 1980s, speech recognition now uses the basic framework of pattern recognition, which is divided into five steps: data preparation, signal processing, feature extraction, model training, and test application. In order to facilitate everyone’s understanding, a flow chart has been specially drawn, such as As shown in the figure:

img
Speech recognition processing flow

This picture is to facilitate everyone to understand the general recognition processing flow of speech recognition:

The first step is to collect sound signals

First of all, we need to collect voice signals, that is, recording as the saying goes. The voices are stored by the microphones and voice collection modules in our mobile phones or electronic devices such as computers.

The second step of sound signal processing

Everyone should know that sound is actually a wave. Common formats such as mp3 and wmv are compressed formats and must be converted into uncompressed pure wave files for processing, such as Windows PCM files, which are commonly known as wav files. Except for a file header, stored in the wav file are points of the sound waveform. The following figure is an example of the waveform:

img

Sound waveform

Signal processing is divided into two parts: noise reduction processing and preprocessing. The sound data we collect contains most of the noise and useless sound frequency bands. First, we use spectral subtraction and other noise reduction processing methods to denoise, leaving useful sound signals. The denoising comparison chart is as follows:

img

Before denoising

img
After denoising

Then use pre-emphasis and other pre-processing means make the voice signal characteristics that you want to recognize become more obvious. In the preprocessing part, there are also frame windowing and endpoint detection. The purpose is to remove the DC offset component and some low-frequency noise in the signal. First understand that it is for the convenience of the next step to extract the feature parameters more accurately. I will explain to you the meaning of related professional terms.

The third step of feature extraction

Feature extraction is the method and process of using a computer to extract characteristic information from a sound signal. For example, I said: "I like you". In the speech recognition process, the text will be converted into a coded form and separated by syllables and phonemes. The word wo is recognized, and w and w are extracted from the audio ripple. o is equivalent to feature extraction.

img
Continuous speech recognition frame diagram

The fourth step is classification and recognition

classification is to be classified according to the restriction of the input speech using a speech recognition system.

​ Considering the relevance of the speaker and the recognition system, the recognition system can be divided into three categories:

​ (1) Specific person speech recognition system: only consider the recognition of the specific person's voice;

​ (2) Person-independent voice system: The recognized voice has nothing to do with people, and usually a large number of different people's voice databases are used to learn the recognition system;

​ (3) Multi-person recognition system: usually can recognize the voice of a group of people, or become a specific group of voice recognition system, the system only requires training on the voice of the group of people to be recognized.

Speech recognition technology is mainly divided into three categories

The first category is model matching method , including vector quantization (VQ), dynamic time warping (DTW), etc.;

The second category is probability statistics method , including Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), etc.;

The third category is the discriminator classification method , such as support vector machine (SVM), artificial neural network (ANN), deep neural network (DNN), etc. and multiple combination methods.

In terms of classification and recognition methods, there are traditional algorithm models HMM, etc., as well as deep learning, machine learning algorithms SVM, etc., which are currently developing hot Let everyone learn related knowledge in an easy-to-understand way!

img
Speech codec

Finally, to summarize, speech recognition is actually a process of encoding and then decoding, and signal processing and feature extraction are the processes of encoding. In other words, it is a pattern recognition based on speech characteristic parameters, that is, through learning, the system can classify the input speech according to a certain pattern, and then find the best matching result according to the judgment criterion.

Fourth, the main online development platform for speech recognition

1. iFLYTEK Voice

2. Baidu Voice

3、Microsoft Speech API

4、Google Speech API

5、IBM viaVoice

6、Nuance NVP

7, the sound network agora API

Five, the learning dry goods of speech recognition

Books

"Illustrated Speech Recognition" , Masahiro Araki (author) Shuyang Chen, Wengang Yang (translator)

This book is very friendly and basic to Xiaobai, and it is easy to get started in the form of illustrations.

"Analyzing Deep Learning: Speech Recognition Practice" , by Yu Dong and Deng Li.

This book is considered to be a relatively good tutorial written in Chinese. The content is very new and the depth of deep learning is very large. It is recommended by students who like algorithms.

"Spoken Language Processing-A Guide to Theory, Algorithm and System Development" , Huang Xuedong waits.

This book is basically a complete collection of traditional ASR methods, and has a considerable amount of space in both theory and engineering practice.

tutorial

Students who are able to learn more can study the following tutorials:

http://tts.speech.cs.cmu.edu/courses/11492/schedule.html

Speech Processing. This tutorial of CMU mainly includes three aspects: ASR (Automatic Speech Recognition), TTS (Text To Speech) and SDS (Spoken Dialog Systems).

http://www.cs.cmu.edu/~awb/

Scottish computer scientist and speech processing expert. There are many speech and NLP tutorials on his homepage.

http://www.inf.ed.ac.uk/teaching/courses/asr/index.html

Automatic Speech Recognition. This course has started at least in 2012 and is updated every year.


RTE开发者社区
647 声望966 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。