Speech recognition is an interdisciplinary subfield of computer science and computational linguistics. It can recognize spoken language and translate it into text. It is also called Automatic Speech Recognition (ASR), Computer Speech Recognition or Speech to Text (STT).
Machine learning (ML) is an application of artificial intelligence (AI) that enables the system to automatically learn and improve from experience without the need for explicit programming. Machine learning has provided most of the breakthroughs in speech recognition in this century. Nowadays, voice recognition technology is everywhere, such as Apple Siri, Amazon Echo and Google Nest.
Speech recognition and speech response (also known as speech synthesis or text-to-speech (TTS)) are supported by the Web speech API.
In this article, we focus on speech recognition in JavaScript applications. Another article introduced speech synthesis .
Speech recognition interface
SpeechRecognition
is the controller interface of the identification service, which is called webkitSpeechRecognition
in Chrome. SpeechRecognition
processing transmitted from the recognition service SpeechRecognitionEvent
. SpeechRecognitionEvent.results
returns a SpeechRecognitionResultList
object, which represents all the speech recognition results of the current conversation.
You can use the following lines of code to initialize SpeechRecognition
:
// 创建一个SpeechRecognition对象
const recognition = new webkitSpeechRecognition();
// 配置设置以使每次识别都返回连续结果
recognition.continuous = true;
// 配置应返回临时结果的设置
recognition.interimResults = true;
// 正确识别单词或短语时的事件处理程序
recognition.onresult = function (event) {
console.log(event.results);
};
ognition.start()
starts speech recognition, and ognition.stop()
stops speech recognition, it can also be aborted ( recognition.abort
).
When the page is accessing your microphone, a microphone icon will appear in the address bar to show that the microphone is turned on and running.
We speak to the page in sentences. "Hello comma I'm talking period." onresult
displays all temporary results while we are talking.
This is the HTML code for this example:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Speech Recognition</title>
<script>
window.onload = () => {
const button = document.getElementById('button');
button.addEventListener('click', () => {
if (button.style['animation-name'] === 'flash') {
recognition.stop();
button.style['animation-name'] = 'none';
button.innerText = 'Press to Start';
content.innerText = '';
} else {
button.style['animation-name'] = 'flash';
button.innerText = 'Press to Stop';
recognition.start();
}
});
const content = document.getElementById('content');
const recognition = new webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;
recognition.onresult = function (event) {
let result = '';
for (let i = event.resultIndex; i < event.results.length; i++) {
result += event.results[i][0].transcript;
}
content.innerText = result;
};
};
</script>
<style>
button {
background: yellow;
animation-name: none;
animation-duration: 3s;
animation-iteration-count: infinite;
}
@keyframes flash {
0% {
background: red;
}
50% {
background: green;
}
}
</style>
</head>
<body>
<button id="button">Press to Start</button>
<div id="content"></div>
</body>
</html>
Line 25 creates the SpeechRecognition
object, and lines 26 and 27 configure the SpeechRecognition
object.
When a word or phrase is correctly recognized, line 28-34 sets an event handler.
Line 19 starts voice recognition, and line 12 stops voice recognition.
On line 12, after clicking the button, it may still print some messages. This is because Recognition.stop()
tries to return the SpeechRecognitionResult
captured so far. If you want it to stop completely, please use ognition.abort()
instead.
You will see that the code for the animated button (lines 38-51) is longer than the voice recognition code. This is a video clip of this example: https://youtu.be/5V3bb5YOnj0
The following is the browser compatibility table:
Network speech recognition relies on the browser's own speech recognition engine. In Chrome, this engine performs recognition in the cloud. Therefore, it can only be run online.
Speech recognition library
There are some open source speech recognition libraries, the following is a list of these libraries based on npm trends:
1. Annyang
Annyang is a JavaScript speech recognition library for controlling websites through voice commands. It is built on the SpeechRecognition Web API. In the next section, we will illustrate how annyang works.
2. artyom.js
artyom.js is a JavaScript speech recognition and speech synthesis library. It is built on the basis of Web Voice API, in addition to voice commands, it also provides voice response.
3. Mumble
Mumble is a JavaScript speech recognition library for controlling websites through voice commands. It is built on the SpeechRecognition Web API, which is similar to how annyang works.
4. julius.js
Julius is a high-performance, large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It can perform real-time decoding on various computers and devices ranging from microcomputers to cloud servers. Julis is built using the C language, and julius.js is a ported version of Julius who thinks it is JavaScript.
5.voice-commands.js
voice-commands.js is a JavaScript speech recognition library for controlling websites through voice commands. It is built on the SpeechRecognition Web API, which is similar to how annyang works.
Annyang
Annyang initializes a SpeechRecognition
object, which is defined as follows:
var SpeechRecognition = root.SpeechRecognition ||
root.webkitSpeechRecognition ||
root.mozSpeechRecognition ||
root.msSpeechRecognition ||
root.oSpeechRecognition;
There are some APIs to start or stop Annyang:
annyang.start
: Use options (auto restart, continuous or pause) to start monitoring, for exampleannyang.start({autoRestart:true,Continuous:false})
.annyang.abort
: Stop listening (stop the SpeechRecognition engine or turn off the microphone).annyang.pause
: Stop listening (no need to stop the SpeechRecognition engine or turn off the microphone).annyang.resume
: Start listening without any options.
This is the HTML code for this example:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Annyang</title>
<script src="//cdnjs.cloudflare.com/ajax/libs/annyang/2.6.1/annyang.min.js"></script>
<script>
window.onload = () => {
const button = document.getElementById('button');
button.addEventListener('click', () => {
if (button.style['animation-name'] === 'flash') {
annyang.pause();
button.style['animation-name'] = 'none';
button.innerText = 'Press to Start';
content.innerText = '';
} else {
button.style['animation-name'] = 'flash';
button.innerText = 'Press to Stop';
annyang.start();
}
});
const content = document.getElementById('content');
const commands = {
hello: () => {
content.innerText = 'You said hello.';
},
'hi *splats': (name) => {
content.innerText = `You greeted to ${name}.`;
},
'Today is :day': (day) => {
content.innerText = `You said ${day}.`;
},
'(red) (green) (blue)': () => {
content.innerText = 'You said a primary color name.';
},
};
annyang.addCommands(commands);
};
</script>
<style>
button {
background: yellow;
animation-name: none;
animation-duration: 3s;
animation-iteration-count: infinite;
}
@keyframes flash {
0% {
background: red;
}
50% {
background: green;
}
}
</style>
</head>
<body>
<button id="button">Press to Start</button>
<div id="content"></div>
</body>
</html>
The annyang source code is added to line 7.
Line 20 starts annyang, and line 13 pauses annyang.
Annyang provides voice commands to control web pages (lines 26-42).
Line 27 is a simple command. If the user greets, the page will reply "You said'hello'."
Line 30 is splats
, which greedily captures the multi-word text at the end of the command. If you say " hi
, Alice e", its answer is "You greet Alice." If you say "Hi, Alice and John," its answer is "You greet Alice and John."
Line 33 is a command with named variables. The day of the week was captured as day
and was called out in the response.
Line 36 is the command with optional words. If you say "yellow", ignore it. If you mention any primary color, it will respond with "You said the name of the primary color".
All the commands defined from line 26 to line 39 are added to annyang on line 41.
... ...
the end
We have learned about speech recognition in JavaScript applications, and Chrome provides the best support for Web Speech API. All our examples are implemented and tested on the Chrome browser.
While exploring the Web Voice API, here are some tips: If you don’t want to listen in your daily life, remember to close the voice recognition application.
This article was first published on the public account
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。