[AI full-stack SOTA review] You don’t know these, how dare you say that you know AI? [Speech recognition principle + actual combat]

【AI全栈SOTA综述】这些你都不知道，怎么敢说会AI？【语音识别原理+实战】

Chapter Directory

前言
语音识别原理
    信号处理，声学特征提取
    识别字符，组成文本
    声学模型
    语言模型
    词汇模型
语音声学特征提取：MFCC和LogFBank算法的原理
实战一 ASR语音识别模型
        系统的流程
        基于HTTP协议的API接口
        客户端
        未来
实战二 调百度和科大讯飞API
实战三 离线语音识别 Vosk

Preface

Principles of Speech Recognition

The first is voice tasks, such as voice recognition and voice wake-up. Hearing these, you will think of Chinese platforms such as iFlytek and Baidu. Because these two companies occupy 80% of China's voice market, they are doing very well. But due to high-precision technology, they cannot open source, other companies have to spend a lot of money to buy their API, but speech recognition and other applications are difficult to learn (I trained a speech recognition project, 10 graphics cards need to run for 20 days) , Which led to the slow development of folk speech recognition. Chen Jun collected a large number of SOTA principles and actual combat parts in the current field. Let us feast our eyes today!

Voice sampling

In the digitization process after voice input, first determine the start and end of the voice, and then perform noise reduction and filtering (in addition to human voice, there are many noises) to ensure that the computer can recognize the filtered voice information. For further processing, the audio signal frame needs to be processed. At the same time, from a microscopic point of view, people's voice signals are generally relatively stable for a period of time, which is the so-called short-term stationarity, so it is necessary to perform inter-frame processing on the voice signal for ease of processing.

Usually a frame takes 20-50ms, and there is overlap and redundancy between frames, which avoids the weakening of the signals at both ends of the frame and affects the recognition accuracy. Next is the key feature extraction. Since the recognition of the original waveform cannot achieve a good recognition effect, it is necessary to extract the characteristic parameters through frequency domain transformation. The commonly used transformation method is to extract MFCC features and transform each frame of waveform into an original waveform vector matrix according to the physiological characteristics of the human ear.

The frame-by-frame vector is not very intuitive. You can also use the spectrogram in the figure below to represent speech. Each column is a block of 25 milliseconds from left to right. Compared with the original sound wave, it is much easier to find the law from this kind of data.

However, the spectrogram is mainly used for speech research, and speech recognition also requires the use of feature vectors frame by frame.

Recognize characters and compose text

After the feature extraction is completed, feature recognition and character generation are performed. The work of this part is to find the current phoneme from each frame, then construct words from multiple phonemes, and then construct words from the words. Of course, the most difficult thing is to find the current phoneme from each frame, because each frame is less than one phoneme, and only multiple frames can form a phoneme. If it was wrong at first, it will be difficult to correct it later. How to determine which phoneme each frame belongs to? The simplest method is probability, which phoneme has the highest probability. What if the probability of multiple phonemes in each frame is the same? After all, it is possible. Everyone's accent, speaking speed and intonation are different, and it is difficult for people to understand whether you are talking about hello or Hall. However, there is only one text result of speech recognition, and it is impossible for people to participate in the choice of error correction. At this time, multiple phonemes constitute the statistical decision of the word, and the word constitutes the text

This allows us to get three possible transcriptions-"Hello", "Hula" and "Olo". Finally, according to the probability of the word, we will find that hello is the most likely, so we output the text of hello. The above example clearly describes how probability determines everything from frame to phoneme, and then from phoneme to word. How to obtain these probabilities? Can we count all the phonemes, words and sentences that humans have spoken for thousands of years in order to recognize a language and then calculate the probability? This is impossible. what should we do? Then we need the model:

Acoustic model

cv I believe that everyone must know what an acoustic model is~ According to the basic state and probability of the speech, try to obtain speech corpus of different people, age, gender, accent, and speaking speed, and try to collect various quiet, noisy, and distant speech corpora. To generate an acoustic model. In order to achieve better results, different languages and dialects will use different acoustic models to improve accuracy and reduce the amount of calculation.

Language model

Then a lot of text training is performed on the basic language model, the probability of words and sentences. If there are only two sentences "Today Monday" and "Tomorrow Tuesday" in the model, we can only recognize these two sentences. If we want to recognize more sentences, we only need to cover enough corpus, but the model will increase and the amount of calculation will increase. Therefore, the models in our actual applications are usually limited to application areas, such as smart home, navigation, smart speakers, personal assistants, medical care, etc., which can reduce the amount of calculation and improve the accuracy.

Vocabulary model

Finally, it is a more commonly used vocabulary model, a supplement to the language model, a language dictionary and notes of different pronunciations. For example, place names, personal names, song names, popular vocabulary, special vocabulary in certain fields, etc. will be updated regularly. At present, there are many simplified but effective calculation methods, such as HMM hidden Markov model. The hidden Markov model is mainly based on two assumptions: one is that the internal state transition is only related to the previous state, and the other is that the output value is only related to the current state (or the current state transition). The problem is simplified, that is, the probability of a word sequence in a sentence is only related to the previous word, so the calculation amount is greatly simplified.

Finally, the speech is recognized as text. Speech acoustic feature extraction: MFCC and logfbank algorithm principle

In almost all automatic speech recognition systems, the first step is to extract the characteristics of the speech signal. By extracting the relevant features of the speech signal, it is helpful to identify the relevant speech information and eliminate irrelevant information such as background noise and emotion.

1 MFCC

Mr. cv just talked about MFCC. This is very classic. The full name of MFCC is "Mel Frequency Cepstrum Coefficient". This speech feature extraction algorithm has been one of the commonly used algorithms for decades. This algorithm is obtained by linearly transforming the logarithmic energy spectrum of the nonlinear Mel in the sound frequency.

1.1 Framing

Since the original wav audio file stored in the computer hard disk is of variable length, we first need to cut it into several fixed-length pieces, namely frames. According to the characteristics of rapid changes in the voice signal, the duration of each frame is generally 10-30ms to ensure that there are enough cycles in a frame and the changes will not be too drastic. Therefore, this Fourier transform is more suitable for the analysis of stationary signals. Due to the different sampling rates of digital audio, the dimensionality of each frame vector is also different.

1.2 Pre-emphasis

Since the sound signal emitted by the human glottis has 12dB/octave attenuation, and the sound signal emitted by the lips has 6dB/octave attenuation, there are few components in the high-frequency signal after the fast Fourier transform. Therefore, the main purpose of the voice signal pre-emphasis operation is to enhance the high-frequency part of each frame of the voice signal, thereby improving the resolution of the high-frequency signal.

1.3 Windowing

In the previous framing process, a continuous speech signal is directly divided into several segments, which will cause spectrum leakage due to the truncation effect. The purpose of the windowing operation is to eliminate the problem of short-term signal discontinuity at both ends of each frame. In the MFCC algorithm, the window functions are usually Hamming window, rectangular window and Hanning window. It should be noted that pre-emphasis must be performed before opening the window.

1.4 Fast Fourier Transform

After the above series of processing, we still get the time domain signal, and the amount of voice information that can be directly obtained in the time domain is less. In the further feature extraction of the speech signal, it is necessary to convert the time domain signal of each frame into its frequency domain signal. For the speech signal stored in the computer, we need to use the discrete Fourier transform. Due to the high computational complexity of ordinary discrete Fourier transform, it is usually realized by fast Fourier transform. Since the MFCC algorithm is divided into frames, each frame is a short-time domain signal, so this step is also called short-time fast Fourier transform.

1.5 Calculating the amplitude spectrum (modulo complex numbers)

After the fast Fourier transform is completed, the speech feature is a complex matrix, which is an energy spectrum. Since the phase spectrum in the energy spectrum contains very little information, we generally choose to discard the phase spectrum and retain the amplitude spectrum.

There are generally two methods for discarding the phase spectrum and retaining the amplitude spectrum. They are to find the absolute value or square value of each complex number.

1.6 Mel filter

The process of Mel filtering is one of the keys of MFCC. The Mel filter is composed of 20 triangular band-pass filters, which convert linear frequencies into non-linearly distributed Mel frequencies.

2 logfBank

The log bank feature extraction algorithm is similar to the MFCC algorithm, which is based on the log bank feature extraction results. But the main difference between logfBank and MFCC algorithm is whether to perform discrete cosine transform.

With the advent of DNN and CNN, especially the development of deep learning, neural networks can better use the correlation between fBank and logfBank features to improve the accuracy of final speech recognition and reduce WER, so the discrete cosine transform can be omitted. step.

SOTA Principle + Actual Combat 1 Deep Fully Convolutional Neural Network Speech Recognition

In recent years, deep learning has emerged in the field of artificial intelligence, and it has also had a profound impact on speech recognition. Deep neural networks have gradually replaced the original HMM hidden Markov model. In human communication and knowledge dissemination, about 70% of information comes from voice. In the future, voice recognition will inevitably become an important part of smart life. It can provide the necessary foundation for voice assistance and voice input, which will become a new way of human-computer interaction. Therefore, we need to make machines understand human voices.

The acoustic model of the speech recognition system uses a deep fully convolutional neural network, directly taking the spectrogram as input. In the structure of the model, the best network configuration VGG in image recognition is used for reference. This network model has a strong expressive ability, can see a long history and future information, and is more robust than RNN. At the output end, the model can be perfectly combined with the CTC solution to achieve end-to-end training of the entire model, and directly transcribe the sound waveform signal into a Mandarin Pinyin sequence. In the language model, the pinyin sequence is converted into Chinese text through the maximum entropy hidden Markov model. And, in order to provide services to all users through the network. Feature extraction converts ordinary wav speech signals into two-dimensional spectrum image signals required by neural networks through framing and windowing operations

CTC decoding in the output of the acoustic analysis model of the speech recognition information system often contains a large number of continuous repeated symbols used by enterprises. Therefore, we need to merge consecutive identical matches into the same symbol, and then remove the mute separation mark Characters have been developed to finally solve the actual phonetic learning phonetic symbol sequence.

The language model uses a statistical language model to convert Pinyin into the final recognized text and output it. The essence of Pinyin to text is modeled as a hidden Markov chain, which has a high accuracy rate. Below is an in-depth analysis of the code, the package will series~

Import the Keras series.

import platform as plat
import os
import time

from general_function.file_wav import *
from general_function.file_dict import *
from general_function.gen_func import *
from general_function.muti_gpu import *

import keras as kr
import numpy as np
import random

from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Input, Reshape, BatchNormalization # , Flatten
from keras.layers import Lambda, TimeDistributed, Activation,Conv2D, MaxPooling2D,GRU #, Merge
from keras.layers.merge import add, concatenate
from keras import backend as K
from keras.optimizers import SGD, Adadelta, Adam

from readdata24 import DataSpeech

The default output size of the pinyin of the imported acoustic model is 1428, that is, 1427 pinyin + 1 blank block.

abspath = ''
ModelName='261'
NUM_GPU = 2

class ModelSpeech(): # 语音模型类
  def __init__(self, datapath):
    '''
    初始化
    默认输出的拼音的表示大小是1428，即1427个拼音+1个空白块
    '''
    MS_OUTPUT_SIZE = 1428
    self.MS_OUTPUT_SIZE = MS_OUTPUT_SIZE # 神经网络最终输出的每一个字符向量维度的大小
    #self.BATCH_SIZE = BATCH_SIZE # 一次训练的batch
    self.label_max_string_length = 64
    self.AUDIO_LENGTH = 1600
    self.AUDIO_FEATURE_LENGTH = 200
    self._model, self.base_model = self.CreateModel()

Conversion path

  self.datapath = datapath
  self.slash = ''
  system_type = plat.system() # 由于不同的系统的文件路径表示不一样，需要进行判断
  if(system_type == 'Windows'):
    self.slash='\\' # 反斜杠
  elif(system_type == 'Linux'):
    self.slash='/' # 正斜杠
  else:
    print('*[Message] Unknown System\n')
    self.slash='/' # 正斜杠
  if(self.slash != self.datapath[-1]): # 在目录路径末尾增加斜杠
    self.datapath = self.datapath + self.slash

Define CNN/LSTM/CTC model, use functional model, design input layer, hidden layer and output layer.

def CreateModel(self):
  '''
  定义CNN/LSTM/CTC模型，使用函数式模型
  输入层：200维的特征值序列，一条语音数据的最大长度设为1600（大约16s）
  隐藏层：卷积池化层，卷积核大小为3x3，池化窗口大小为2
  隐藏层：全连接层
  输出层：全连接层，神经元数量为self.MS_OUTPUT_SIZE，使用softmax作为激活函数，
  CTC层：使用CTC的loss作为损失函数，实现连接性时序多输出
  
  '''
  
  input_data = Input(name='the_input', shape=(self.AUDIO_LENGTH, self.AUDIO_FEATURE_LENGTH, 1))
  
  layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', kernel_initializer='he_normal')(input_data) # 卷积层
  #layer_h1 = Dropout(0.05)(layer_h1)
  layer_h2 = Conv2D(32, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h1) # 卷积层
  layer_h3 = MaxPooling2D(pool_size=2, strides=None, padding="valid")(layer_h2) # 池化层
  
  #layer_h3 = Dropout(0.05)(layer_h3) # 随机中断部分神经网络连接，防止过拟合
  layer_h4 = Conv2D(64, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h3) # 卷积层
  #layer_h4 = Dropout(0.1)(layer_h4)
  layer_h5 = Conv2D(64, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h4) # 卷积层
  layer_h6 = MaxPooling2D(pool_size=2, strides=None, padding="valid")(layer_h5) # 池化层
  
  #layer_h6 = Dropout(0.1)(layer_h6)
  layer_h7 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h6) # 卷积层
  #layer_h7 = Dropout(0.15)(layer_h7)
  layer_h8 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h7) # 卷积层
  layer_h9 = MaxPooling2D(pool_size=2, strides=None, padding="valid")(layer_h8) # 池化层
  
  #layer_h9 = Dropout(0.15)(layer_h9)
  layer_h10 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h9) # 卷积层
  #layer_h10 = Dropout(0.2)(layer_h10)
  layer_h11 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h10) # 卷积层
  layer_h12 = MaxPooling2D(pool_size=1, strides=None, padding="valid")(layer_h11) # 池化层
  
  #layer_h12 = Dropout(0.2)(layer_h12)
  layer_h13 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h12) # 卷积层
  #layer_h13 = Dropout(0.3)(layer_h13)
  layer_h14 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', kernel_initializer='he_normal')(layer_h13) # 卷积层
  layer_h15 = MaxPooling2D(pool_size=1, strides=None, padding="valid")(layer_h14) # 池化层
  
  #test=Model(inputs = input_data, outputs = layer_h12)
  #test.summary()
  
  layer_h16 = Reshape((200, 3200))(layer_h15) #Reshape层
  
  #layer_h16 = Dropout(0.3)(layer_h16) # 随机中断部分神经网络连接，防止过拟合
  layer_h17 = Dense(128, activation="relu", use_bias=True, kernel_initializer='he_normal')(layer_h16) # 全连接层
  
  inner = layer_h17
  #layer_h5 = LSTM(256, activation='relu', use_bias=True, return_sequences=True)(layer_h4) # LSTM层
  
  rnn_size=128
  gru_1 = GRU(rnn_size, return_sequences=True, kernel_initializer='he_normal', name='gru1')(inner)
  gru_1b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru1_b')(inner)
  gru1_merged = add([gru_1, gru_1b])
  gru_2 = GRU(rnn_size, return_sequences=True, kernel_initializer='he_normal', name='gru2')(gru1_merged)
  gru_2b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru2_b')(gru1_merged)
  
  gru2 = concatenate([gru_2, gru_2b])
  
  layer_h20 = gru2
  #layer_h20 = Dropout(0.4)(gru2)
  layer_h21 = Dense(128, activation="relu", use_bias=True, kernel_initializer='he_normal')(layer_h20) # 全连接层
  
  #layer_h17 = Dropout(0.3)(layer_h17)
  layer_h22 = Dense(self.MS_OUTPUT_SIZE, use_bias=True, kernel_initializer='he_normal')(layer_h21) # 全连接层
  
  y_pred = Activation('softmax', name='Activation0')(layer_h22)
  model_data = Model(inputs = input_data, outputs = y_pred)
  #model_data.summary()
  
  labels = Input(name='the_labels', shape=[self.label_max_string_length], dtype='float32')
  input_length = Input(name='input_length', shape=[1], dtype='in
                       
  label_length = Input(name='label_length', shape=[1], dtype='int64')
  # Keras doesn't currently support loss funcs with extra parameters
  # so CTC loss is implemented in a lambda layer
  
  #layer_out = Lambda(ctc_lambda_func,output_shape=(self.MS_OUTPUT_SIZE, ), name='ctc')([y_pred, labels, input_length, label_length])#(layer_h6) # CTC
  loss_out = Lambda(self.ctc_lambda_func, output_shape=(1,), name='ctc')([y_pred, labels, input_length, label_length])

Model loading method

  model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out)
  
  model.summary()
  
  # clipnorm seems to speeds up convergence
  #sgd = SGD(lr=0.0001, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5)
  #ada_d = Adadelta(lr = 0.01, rho = 0.95, epsilon = 1e-06)
  opt = Adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999, decay = 0.0, epsilon = 10e-8)
  #model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=sgd)
  
  model.build((self.AUDIO_LENGTH, self.AUDIO_FEATURE_LENGTH, 1))
  model = ParallelModel(model, NUM_GPU)
  
  model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = opt)

Define ctc decoding

  # captures output of softmax so we can decode the output during visualization
  test_func = K.function([input_data], [y_pred])
  
  #print('[*提示] 创建模型成功，模型编译成功')
  print('[*Info] Create Model Successful, Compiles Model Successful. ')
  return model, model_data
  
def ctc_lambda_func(self, args):
  y_pred, labels, input_length, label_length = args
  
  y_pred = y_pred[:, :, :]
  #y_pred = y_pred[:, 2:, :]
  return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

Define training model and training parameters

def TrainModel(self, datapath, epoch = 2, save_step = 1000, batch_size = 32, filename = abspath + 'model_speech/m' + ModelName + '/speech_model'+ModelName):
  '''
  训练模型
  参数：
    datapath: 数据保存的路径
    epoch: 迭代轮数
    save_step: 每多少步保存一次模型
    filename: 默认保存文件名，不含文件后缀名
  '''
  data=DataSpeech(datapath, 'train')
  
  num_data = data.GetDataNum() # 获取数据的数量
  
  yielddatas = data.data_genetator(batch_size, self.AUDIO_LENGTH)
  
  for epoch in range(epoch): # 迭代轮数
    print('[running] train epoch %d .' % epoch)
    n_step = 0 # 迭代数据数
    while True:
      try:
        print('[message] epoch %d . Have train datas %d+'%(epoch, n_step*save_step))
        # data_genetator是一个生成器函数
        
        #self._model.fit_generator(yielddatas, save_step, nb_worker=2)
        self._model.fit_generator(yielddatas, save_step)
        n_step += 1
      except StopIteration:
        print('[error] generator error. please check data format.')
        break
      
      self.SaveModel(comment='_e_'+str(epoch)+'_step_'+str(n_step * save_step))
      self.TestModel(self.datapath, str_dataset='train', data_count = 4)
      self.TestModel(self.datapath, str_dataset='dev', data_count = 4)
      
def LoadModel(self,filename = abspath + 'model_speech/m'+ModelName+'/speech_model'+ModelName+'.model'):
  '''
  加载模型参数
  '''
  self._model.load_weights(filename)
  self.base_model.load_weights(filename + '.base')

def SaveModel(self,filename = abspath + 'model_speech/m'+ModelName+'/speech_model'+ModelName,comment=''):
  '''
  保存模型参数
  '''
  self._model.save_weights(filename+comment+'.model')
  self.base_model.save_weights(filename + comment + '.model.base')
  f = open('step'+ModelName+'.txt','w')
  f.write(filename+comment)
  f.close()

def TestModel(self, datapath='', str_dataset='dev', data_count = 32, out_report = False, show_ratio = True):
  '''
  测试检验模型效果
  '''
  data=DataSpeech(self.datapath, str_dataset)
  #data.LoadDataList(str_dataset) 
  num_data = data.GetDataNum() # 获取数据的数量
  if(data_count <= 0 or data_count > num_data): # 当data_count为小于等于0或者大于测试数据量的值时，则使用全部数据来测试
    data_count = num_data
  
  try:
    ran_num = random.randint(0,num_data - 1) # 获取一个随机数
    
    words_num = 0
    word_error_num = 0
    
    nowtime = time.strftime('%Y%m%d_%H%M%S',time.localtime(time.time()))
    if(out_report == True):
      txt_obj = open('Test_Report_' + str_dataset + '_' + nowtime + '.txt', 'w', encoding='UTF-8') # 打开文件并读入
    
    txt = ''
    for i in range(data_count):
      data_input, data_labels = data.GetData((ran_num + i) % num_data)  # 从随机数开始连续向后取一定数量数据
      
      # 数据格式出错处理 开始
      # 当输入的wav文件长度过长时自动跳过该文件，转而使用下一个wav文件来运行
      num_bias = 0
      while(data_input.shape[0] > self.AUDIO_LENGTH):
        print('*[Error]','wave data lenghth of num',(ran_num + i) % num_data, 'is too long.','\n A Exception raise when test Speech Model.')
        num_bias += 1
        data_input, data_labels = data.GetData((ran_num + i + num_bias) % num_data)  # 从随机数开始连续向后取一定数量数据
      # 数据格式出错处理 结束
      
      pre = self.Predict(data_input, data_input.shape[0] // 8)
      
      words_n = data_labels.shape[0] # 获取每个句子的字数
      words_num += words_n # 把句子的总字数加上
      edit_distance = GetEditDistance(data_labels, pre) # 获取编辑距离
      if(edit_distance <= words_n): # 当编辑距离小于等于句子字数时
        word_error_num += edit_distance # 使用编辑距离作为错误字数
      else: # 否则肯定是增加了一堆乱七八糟的奇奇怪怪的字
        word_error_num += words_n # 就直接加句子本来的总字数就好了
      
      if(i % 10 == 0 and show_ratio == True):
        print('Test Count: ',i,'/',data_count)
      
      txt = ''
      if(out_report == True):
        txt += str(i) + '\n'
        txt += 'True:\t' + str(data_labels) + '\n'
        txt += 'Pred:\t' + str(pre) + '\n'
        txt += '\n'
        txt_obj.write(txt)

Define the prediction function and return the prediction result.

    #print('*[测试结果] 语音识别 ' + str_dataset + ' 集语音单字错误率：', word_error_num / words_num * 100, '%')
    print('*[Test Result] Speech Recognition ' + str_dataset + ' set word error ratio: ', word_error_num / words_num * 100, '%')
    if(out_report == True):
      txt = '*[测试结果] 语音识别 ' + str_dataset + ' 集语音单字错误率： ' + str(word_error_num / words_num * 100) + ' %'
      txt_obj.write(txt)
      txt_obj.close()
    
  except StopIteration:
    print('[Error] Model Test Error. please check data format.')

def Predict(self, data_input, input_len):
  '''
  预测结果
  返回语音识别后的拼音符号列表
  '''
  
  batch_size = 1 
  in_len = np.zeros((batch_size),dtype = np.int32)
  
  in_len[0] = input_len
  
  x_in = np.zeros((batch_size, 1600, self.AUDIO_FEATURE_LENGTH, 1), dtype=np.float)
  
  for i in range(batch_size):
    x_in[i,0:len(data_input)] = data_input

  base_pred = self.base_model.predict(x = x_in)
  
  #print('base_pred:\n', base_pred)
  
  #y_p = base_pred
  #for j in range(200):
  #  mean = np.sum(y_p[0][j]) / y_p[0][j].shape[0]
  #  print('max y_p:',np.max(y_p[0][j]),'min y_p:',np.min(y_p[0][j]),'mean y_p:',mean,'mid y_p:',y_p[0][j][100])
  #  print('argmin:',np.argmin(y_p[0][j]),'argmax:',np.argmax(y_p[0][j]))
  #  count=0
  #  for i in range(y_p[0][j].shape[0]):
  #    if(y_p[0][j][i] < mean):
  #      count += 1
  #  print('count:',count)
  
  base_pred =base_pred[:, :, :]
  #base_pred =base_pred[:, 2:, :]
  
  r = K.ctc_decode(base_pred, in_len, greedy = True, beam_width=100, top_paths=1)
  
  #print('r', r)

  r1 = K.get_value(r[0][0])
  #print('r1', r1)
  #r2 = K.get_value(r[1])
  #print(r2)
  
  r1=r1[0]
  
  return r1
  pass

def RecognizeSpeech(self, wavsignal, fs):
  '''
  最终做语音识别用的函数，识别一个wav序列的语音
  '''
  
  #data = self.data
  #data = DataSpeech('E:\\语音数据集')
  #data.LoadDataList('dev')
  # 获取输入特征
  #data_input = GetMfccFeature(wavsignal, fs)
  #t0=time.time()
  data_input = GetFrequencyFeature3(wavsignal, fs)
  #t1=time.time()
  #print('time cost:',t1-t0)
  
  input_length = len(data_input)
  input_length = input_length // 8
  
  data_input = np.array(data_input, dtype = np.float)
  #print(data_input,data_input.shape)
  data_input = data_input.reshape(data_input.shape[0],data_input.shape[1],1)
  #t2=time.time()
  r1 = self.Predict(data_input, input_length)
  #t3=time.time()
  #print('time cost:',t3-t2)
  list_symbol_dic = GetSymbolList(self.datapath) # 获取拼音列表

Finally, it is used as a function for speech recognition to recognize the speech of a wav sequence

  r_str=[]
  for i in r1:
    r_str.append(list_symbol_dic[i])
  
  return r_str
  pass
  
def RecognizeSpeech_FromFile(self, filename):
  '''
  最终做语音识别用的函数，识别指定文件名的语音
  '''
  
  wavsignal,fs = read_wav_data(filename)
  
  r = self.RecognizeSpeech(wavsignal, fs)
  
  return r
  
  pass

@property
def model(self):
  '''
  返回keras model
  '''
  return self._model
if(__name__=='__main__'):

main function, start

datapath =  abspath + ''
modelpath =  abspath + 'model_speech'
if(not os.path.exists(modelpath)): # 判断保存模型的目录是否存在
  os.makedirs(modelpath) # 如果不存在，就新建一个，避免之后保存模型的时候炸掉

system_type = plat.system() # 由于不同的系统的文件路径表示不一样，需要进行判断
if(system_type == 'Windows'):
  datapath = 'E:\\语音数据集'
  modelpath = modelpath + '\\'
elif(system_type == 'Linux'):
  datapath =  abspath + 'dataset'
  modelpath = modelpath + '/'
else:
  print('*[Message] Unknown System\n')
  datapath = 'dataset'
  modelpath = modelpath + '/'

ms = ModelSpeech(datapath)

Principle + Actual Combat Two Baidu and HKUST IFLYTEK Voice Recognition

The end-to-end deep cooperative learning research method can be used to identify English or Mandarin Chinese, which are two completely different languages. Because each component of the entire process is manually designed using neural networks, end-to-end learning allows us to deal with a variety of sounds, including noisy environments, stress, and different languages. The key to our method is to improve the HPC technology we can apply. The experiment that used to be developed that took several weeks to complete can now be implemented within a few days. This allows students to iterate faster on our own to identify superior architectures and algorithms. Finally, the technology called Batch Dispatch with GPU is used in the data information center. Our research shows that our system analysis can be configured online at the same time through the network, deployed at low cost, and provide a service for the management of a large number of users with low latency.

End-to-end speech recognition is an active research area, and it has been used to re-evaluate the output of DNN-HMM with convincing results. The Rnn codec uses the encoder rnn to map the input to a fixed-length vector, and the decoder network maps the fixed-length vector to the output prediction sequence. The RNN encoder with its own attention-the decoder performs well in predictive phoneme teaching. Combining the ctc loss function and rnn to simulate the time information, and achieved good results in the end-to-end speech recognition of character output. The Ctc-rnn model can also predict phonemes well, although a dictionary is still needed in this case.

Data technology is also the key to the success of the end-to-end speech recognition system. Hannun et al. used more than 7000 hours of Chinese markup language speech. Data enhancement is very effective for improving the performance of deep learning such as computer vision and speech recognition. Existing voice systems can also be used to guide new data collection. We have gained inspiration from the previous methods to guide larger data sets and data increase in order to increase the amount of effective labeling data in the Baidu system.

The present demonstration is to identify the content of audio files. See the official website for token acquisition. There is nothing gold in the package here. Python technology article-Baidu performs voice API authentication and authentication information to obtain Access Token. Note: The following tokens can be applied for by myself. The article yourself applies for exclusive.

import requests
import os
import base64
import json

apiUrl='http://vop.baidu.com/server_api'
filename = "16k.pcm"   # 这是我下载到本地的音频样例文件名
size = os.path.getsize(filename)   # 获取本地语音文件尺寸
file1 = open(filename, "rb").read()   # 读取本地语音文件   
text = base64.b64encode(file1).decode("utf-8")   # 对读取的文件进行base64编码
data = {
    "format":"pcm",   # 音频格式
    "rate":16000,   # 采样率，固定值16000
    "dev_pid":1536,   # 普通话
    "channel":1,   # 频道，固定值1
    "token":"24.0c828682d414bf79b08f89c4c7dcd83a.2592000.1562739150.282335-16470175",   # 重要，鉴权认证Access Token，需要自己来申请
    "cuid":"DC-85-DE-F9-08-59",   # 随便一个值就好了，官网推荐是个人电脑的MAC地址
    "len":size,   # 语音文件的尺寸
    "speech":text,   # base64编码的语音文件
}
try:
    r = requests.post(apiUrl, data = json.dumps(data)).json()
    print(r)
    print(r.get("result")[0])
except Exception as e:
    print(e)

For the same way of iFlytek, please refer to the official website tutorial.

Actual combat three offline voice recognition Vosk

Due to space issues, Mr. cv introduced a lot of principles and Sota algorithm today, so I will share one last one now. You need to know more. Welcome to continue to pay attention to this series.

Vosk supports more than 30 languages, and it is doing well now, it is good in offline voice, https://github.com/alphacep/vosk-api

The pc version with Android python, c++, etc. For the web deployment plan Android, you need to install the Android package, and then download the compilation tool, gradle, and compile it through Gradle.

It can be compiled. After the compilation is successful, the apk installation package will be generated, and the mobile phone can be installed and used offline.

  /**
     * Adds listener.
     */
    public void addListener(RecognitionListener listener) {
        synchronized (listeners) {
            listeners.add(listener);
        }
    }
/**
 * Removes listener.
 */
public void removeListener(RecognitionListener listener) {
    synchronized (listeners) {
        listeners.remove(listener);
    }
}

/**
 * Starts recognition. Does nothing if recognition is active.
 * 
 * @return true if recognition was actually started
 */
public boolean startListening() {
    if (null != recognizerThread)
        return false;

    recognizerThread = new RecognizerThread();
    recognizerThread.start();
    return true;
}

The actual combat here is relatively simple. I have done a lot of optimizations in the follow-up to support the deployment of Android, python, c++, java languages, etc. Welcome to consult cv.

Intelligent voice interaction diagram

Summarize

I have said a lot today, and welcome everyone to watch the official look. This article introduces more Tricks. Interactivity and fun are in the actual combat part later~ and it is the voice algorithm. Today, I can’t show you a lot of interesting things.

In the following article, I can give you a more advanced voice from the shallower to the deeper in other areas:

One: wake word algorithms and models such as Siri, Xiao Ai, and SOTA;

speaker distinction (identification of thoughts)

3: Multilingual thinking + less language + difficult language thinking and SOTA

IV: SOTA plan for each voice competition

[AI full-stack SOTA review] You don’t know these, how dare you say that you know AI? [Speech recognition principle + actual combat]

Chapter Directory

Preface

Principles of Speech Recognition

Voice sampling

Recognize characters and compose text

Acoustic model

Language model

Vocabulary model

1 MFCC

1.1 Framing

1.2 Pre-emphasis

1.3 Windowing

1.4 Fast Fourier Transform

1.5 Calculating the amplitude spectrum (modulo complex numbers)

1.6 Mel filter

2 logfBank

SOTA Principle + Actual Combat 1 Deep Fully Convolutional Neural Network Speech Recognition

Principle + Actual Combat Two Baidu and HKUST IFLYTEK Voice Recognition

Actual combat three offline voice recognition Vosk

Summarize

RTE开发者社区

引用和评论

Hume 推出 Octave TTS 即时模式，250 毫秒响应；客服语音智能体 Sona：简单集成、高度自定义丨日报

【AI日志分析】基于机器学习的异常检测：告别传统规则的智能进阶

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

从 DeepSeek 看25年前端的一个小趋势

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

[AI full-stack SOTA review] You don’t know these, how dare you say that you know AI? [Speech recognition principle + actual combat]

Chapter Directory

Preface

Principles of Speech Recognition

Voice sampling

Recognize characters and compose text

Acoustic model

Language model

Vocabulary model

1 MFCC

1.1 Framing

1.2 Pre-emphasis

1.3 Windowing

1.4 Fast Fourier Transform

1.5 Calculating the amplitude spectrum (modulo complex numbers)

1.6 Mel filter

2 logfBank

SOTA Principle + Actual Combat 1 Deep Fully Convolutional Neural Network Speech Recognition

Principle + Actual Combat Two Baidu and HKUST IFLYTEK Voice Recognition

Actual combat three offline voice recognition Vosk

Summarize

RTE开发者社区

引用和评论

Hume 推出 Octave TTS 即时模式，250 毫秒响应；客服语音智能体 Sona：简单集成、高度自定义丨日报

【AI日志分析】基于机器学习的异常检测：告别传统规则的智能进阶

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

从 DeepSeek 看25年前端的一个小趋势

DeepSeek(私有化)+IDEA+Dify+微信 搭建AI助手保姆级教程

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程