Intelligent voice technology: where does it come from? Where to go?

In recent years, voice recognition technology has gradually matured, and more and more Internet companies and hardware manufacturers are laying out the business landscape of intelligent voice. The wave of the Internet of Everything is unstoppable, and intelligent voice technology is fully blooming in various fields such as automobiles, smart homes, and education.

Where has the intelligent voice been developed? What are the current opportunities and challenges? What kind of form will it develop in the future? This time we interviewed Elon, a senior voice architect from OPPO, who will introduce us to the complete development path of intelligent voice technology.

Q1: Can you briefly introduce the development history of voice technology?

Long before the invention of computers, there was an early speech recognition prototype called "Radio Rex" toy dog in 1920, which can be regarded as the first exploration of intelligent speech technology by humans; and in the true sense, the development of intelligent speech technology based on computers can be the earliest. Dating back to the 1950s, it has been nearly 70 years since the birth of the first speech recognition system Audrey in 1952. In the early days, academic institutions such as Bell Labs and London College were mainly doing the layout in this direction; to 20 Around the 1990s, Sphinx, the world's first non-specific speaker continuous speech recognition system with large vocabulary, appeared, and open source tools such as Cambridge HTK, which were later widely used in academic circles; at that time, China's high-tech development plan 863 program was also launched. As one of the important research directions of intelligent computer systems, speech recognition is specifically listed as a research topic; from the end of the 20th century to the beginning of the 21st century, speech recognition is the stage of rapid development from academia to industrialization, probably around 2009, in-depth Learning has made great progress in the field of voice technology, and the recognition effect has made great breakthroughs; in 2011, Apple’s mobile phone virtual assistant Siri was born. In the following 10 years, voice-related technologies and teams have begun to move from academia to industry, whether it is an Internet company Still traditional hardware manufacturers have begun to deploy intelligent voice technology, and gradually landed a series of well-known intelligent voice interaction products such as Alexa, Google Assistant, Tmall Genie, Xiaodu Xiaodu, and Xiaoai.
Throughout the development process of the entire intelligent voice interaction technology, from the beginning, only very simple command recognition was supported, to the later support for more complicated speech understanding, and large-scale implementation was completed on multiple scenes and multiple devices, which gradually shortened the user experience. A direct path between services; Breeno, the predecessor of Xiaobu Assistant, was born in December 2018 under this background.

Q2: What are the reasons for the rapid development of voice technology in recent years?

First of all, voice is a natural way for humans to communicate information. By recognizing voices and understanding the expressions in them, machines can meet user needs more quickly. In essence, information exchanges between humans and smart devices are more efficient, especially for driving, In scenarios such as the home, voice technology can greatly enhance the human-computer interaction experience.
In addition, technological development is highly related to industry development. The reason why domestic manufacturers make smart speakers is more affected by Amazon's Alexa. Alexa allows foreign users to perceive the convenience of voice interaction in the home scene; domestically, Xiaoai classmates and Tmall Genie are the first to make products, so Some users have used it, and then changed the industry, allowing more entrants to join the track, allowing more users to feel the convenience of smart speakers. As smart speakers enter the home and more home devices support AIOT, users can control more smart devices at home through the hub of smart speakers, and they will increasingly like to use smart interactive products, which is a bit like the Matthew effect. When users perceive the convenience of a product and urge them to buy more products, an ecological closed loop is established, and more and more users are willing to use voice interaction to control devices and obtain services.
Finally, as the utilization rate of smart assistants continues to increase, and the scale of online data continues to expand, we can use more real data to do better model optimization iterations, so as to make the effect better. From the perspective of the evolution of algorithm technology, the past 10-20 years have basically been based on labeled data for model training. For example, to recognize a sentence, you need to first mark every word and every sentence of many sentences as text. Join model training and complete model optimization through supervised learning. Now, the industry is beginning to try unsupervised learning. Facebook has already had scientific research results that prove that unsupervised learning based on unlabeled massive data can also complete speech recognition model training.

Q3: What is the starting point for different manufacturers to make intelligent voice?

In China, many manufacturers are doing it, such as Xiaomi, Ali and Baidu, but each manufacturer has a different starting point for doing this.
Baidu's intelligent voice actually hopes to use Xiaodu to change the product form of search from a pure web text box search to a more natural search input form combined with voice interaction. Through the product of Xiaodu speakers, it collects some user information and builds user portraits. Then recommend to users some content that could only be recommended through web search.
Alibaba's Tmall wizard aims to occupy the traffic entrance of the home scene, and while completing the AIoT ecological construction, it will lead users to content services such as Xiami Music, Youku, Tmall, and Ele.me in the Alibaba ecosystem.
Xiaomi’s starting point for smart speakers is obviously different from these two, because Xiaomi’s starting point is to build an AIoT ecosystem of Xiaomi’s Internet of Everything through "Mijia + Xiaoai Classmates", covering all aspects of smart life.
OPPO’s starting point as Xiaobu Assistant is to build on the mobile phone hardware + software products, through the various capacity building of Xiaobu Assistant, so that users can constantly perceive the product’s “wisdom and understand you”, and at the same time build the company’s technology brand. With the continuous improvement of the company's multi-device ecology, the strategic goal of the integration of everything is finally realized.

Q4: What kind of opportunities are currently facing voice technology?

I think the opportunity is great. First, the cost of user education is reduced. At present, more and more users are entering from the Z generation. People of this generation are more in touch with intelligence. They are not like our parents’ generation or our generation has entered the era of intelligence from an era of no intelligence. These users He has a natural sense of familiarity with voice interaction or AI-type interaction. In addition, the people of Generation Z have directly entered the digital world. He is very familiar with the digital world. Just like a very small child nowadays will hold a mobile phone to touch and operate, and he has been familiar with some of the hardware products very early. Virtual things.
On the other hand, the emotional connection between users and smart products is getting closer. In real life, some children have been sad for a long time because of the death of the game character in the mobile phone, but they rarely feel sad for a long time because of the sad thing of someone around them, or the passing of a real person around them. This actually reflects a problem, that is, many things in the digital world have implicated human senses. At this time, I think smart assistants have great opportunities in this regard. People are becoming more and more integrated with the virtual world in hardware products, which is the so-called sense of substitution. With the increase in life pressure and social pressure, they actually I also want to communicate with virtual characters rather than with people around me. In this situation, smart assistants may become a virtual object that more and more users want to communicate and contact, and voice technology is the most critical emotional and information link among them.

Q5: What kind of dilemma does the current voice technology face?

First of all, users are more worried about privacy leakage. While using intelligent interactive products, users will gradually become aware of privacy issues. In the past few years, we will see users questioning whether the device is monitoring or not on major platforms. For example, I talked to you about an umbrella, but Taobao or Tmall recommended an umbrella to me at night. Therefore, many users want to use voice to obtain services more conveniently, but at the same time they are afraid that the device will be continuously monitored. I think this is a challenge that the entire industry is facing, including the EU's GDPR, which is actually aimed at protecting the privacy and data security of the entire smart ecosystem.
In addition, there is a gap between users' expectations of voice assistants and technical realization capabilities. Behind the voice assistant is the service. The user's expectation of the voice assistant is a real person, but it is digital, so the user's expectation of it is always high. Users usually think that the so-called intelligence is omnipotent, but technology has bottlenecks, which means that technology can only achieve some things within its capabilities. But users have more stringent requirements for smart products. They need smart products to be able to check the weather and chat, and have high EQ and IQ. But back to reality. Very few people have high EQ and IQ. "Hackers and Painters" mentioned a point: the final appearance of each product is similar to those who built the product, because it determines what the soul of the product should look like. For smart assistants, it is made by engineers, product managers, and R&D teams. For example, if there is a team of 100 people, the IQ of these 100 people determines what the smart assistant will look like.

Q6: What will be the application scenarios and forms of intelligent voice in the future?

First of all, from the user perception level, the earliest is to satisfy the user's text-based interaction, and gradually transition to voice interaction, and more transitions to multimodal interaction now and in the future.
In application scenarios, AIoT is more and more widely used in smart homes, and users can control devices throughout the home through voice. There is also smart driving. In fact, in 16 years, Ali cooperated with Zebra Internet Cars, including SAIC Motor Corporation, on a smart car, which has already been equipped with a voice assistant. For some new energy vehicles such as Tesla, Xiaopeng and Weilai, voice assistants have become standard equipment in these vehicles. The fundamental logic is that in the on-board environment, users are more focused on driving safety. Driving safety means that you can’t check your phone while driving and focus on driving operations. When you want to listen to music or make calls while driving, you can only do it through voice interaction to make driving safer. , While making the entire driving experience better. Now every car factory is doing this in the layout, and even set up a self-research team to build its own technology.
In addition, what smart assistants need to do is to make the interaction path between the user and the machine shorter. In the past, it was possible to obtain services through several steps, such as UI touch. But now, simple operations such as weather inquiries and phone calls can be completed in one sentence. But the current interactive path is not short, because the current execution logic is that speech recognition is first converted into text, then text is used for intent understanding, and finally to dialogue management. After that, we will continue to shorten this path so that the machine can directly To understand what people say, there is no need to transform the middle text.
The ultimate form of intelligent voice, we hope that it can be separated from the specific product form and can be completely digital. So I think the integration of everything mentioned in OPPO's corporate strategy is quite imaginative. In the end, in fact, you don’t care whether that thing is a mobile phone, a speaker, or other smart devices. From the user’s point of view, he only cares about one thing, when I need any service, I just speak. No need to use other third-party input media to complete some more complicated operations.

Q7: What do you think about the ecological empowerment of voice assistants?

I think back to the user itself, whether it is to develop ecologically or to develop a certain scene, it is to help users solve some of the core demand problems in a certain scene. For example, with the development of AIoT in the home scene, more and more devices, such as traditional lights and air conditioners, will begin to support voice control. The logic behind it is to solve the problem of inconvenience for users to control these devices at home, and then make the whole home more intelligent. Voice assistants are essentially the medium through which services are reached, and are the most natural way of expression for users to obtain services. Its development direction has always been to solve the core needs of users.

For more exciting content, please pay attention to the [OPPO Digital Intelligence Technology] public account

Intelligent voice technology: where does it come from? Where to go?

Q1: Can you briefly introduce the development history of voice technology?

Q2: What are the reasons for the rapid development of voice technology in recent years?

Q3: What is the starting point for different manufacturers to make intelligent voice?

Q4: What kind of opportunities are currently facing voice technology?

Q5: What kind of dilemma does the current voice technology face?

Q6: What will be the application scenarios and forms of intelligent voice in the future?

Q7: What do you think about the ecological empowerment of voice assistants?

OPPO数智技术

引用和评论

OPPO云数据库访问服务技术揭秘

基于yolov5实现的AI智能盒子框架

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

性能远超SAM系模型，苏黎世大学等开发通用3D血管分割基础模型

18个常用的强化学习算法整理：从基础方法到高级模型的理论技术与代码实现

【vLLM 学习】基础教程

【Triton 教程】triton.heuristics