Intelligent Voice technology: From where? Whither?

In recent years, speech recognition technology gradually mature, more and more Internet companies and hardware manufacturers in the layout of intelligent voice business map. The trend of the Internet of everything is unstoppable, and intelligent voice technology has blossomed in automobiles, smart homes, education and other fields.

Where has intelligent speech developed so far? What are the opportunities and challenges facing us now? What form will it develop into in the future? This time we interviewed Elon, senior speech architect at OPPO, who will introduce us to the complete development path of intelligent speech technology.

Q1: Can you briefly introduce the development history of voice technology?

Long before the invention of computers, the “Radio Rex” toy dog was an early prototype of speech recognition in 1920, which can be regarded as human’s first exploration of intelligent speech technology. In the real sense, the development of intelligent speech technology based on computer can be traced back to the 1950s. It has been nearly 70 years since the birth of Audrey, the first speech recognition system in 1952. In the early days, bell LABS, London College and other academic institutions were doing the layout in this direction. By the 1990s, Sphinx, the world’s first continuous speech recognition system with a large vocabulary of non-specific speakers, and Cambridge HTK and other open source tools were widely used in academic circles. At that time, China’s high-tech development Program 863 program was also launched, speech recognition as an important direction of intelligent computer system research, has been specially listed as a research topic; From the end of the 20th century to the beginning of the 21st century, A is the stage of rapid development of speech recognition and gradual industrialization from academia. Around 2009, deep learning made great efforts in the field of speech technology and made great breakthroughs in recognition effect. Siri, apple’s virtual assistant, was born in 2011. In the past 10 years, voice-related technology and teams have moved from academia to industry. Both Internet companies and traditional hardware manufacturers have started to deploy intelligent voice technology. And gradually implemented a series of well-known intelligent voice interaction products such as Alexa, Google Assistant, Tmall Genie, Xiaodu, Xiaodu and Xiaai Students. Throughout the development history of intelligent voice interaction technology, from very simple instruction recognition at the beginning to more complex verbal understanding, it has completed large-scale implementation in multiple scenes and multiple devices, gradually shortening the direct path between users and services. It was against this backdrop that Breeno, the predecessor of Breeno’s assistant, was born in December 2018.

Q2: What is the reason for the boom in voice technology in recent years?

First of all, the voice is the natural way to convey information, human machine through voice recognition and understanding of the expression, more quickly to meet user needs, is essentially in the exchange of information between people and smart devices more efficient, especially for a scene, such as driving, home voice technology can significantly improve human-computer interaction experience. In addition, technology development is highly correlated with industry development. The reason why domestic manufacturers do smart speakers is more influenced by Amazon’s Alexa, which enables foreign users to perceive the convenience of voice interaction in home scenes. In China, Xiaoai and Tmall Genie first made their products and let some users use them, thus changing the industry, allowing more players to join the track and making more users feel the convenience of smart speakers. As intelligent sound box door, and more household equipment support AIOT, users can through intelligent speakers this central to control more intelligent devices in the home, will be more and more like to use smart interactive product, a bit like Matthew effect, let the user for a product awareness to the convenience, and gave rise to them to buy more products, an ecological closed loop is established, More and more users will be willing to use voice interaction to control devices and access services. Finally, with the increasing use of intelligent assistant and the expansion of online data, we can use more real data to optimize and iterate better models, thus making the results better. From the perspective of the evolution of algorithm technology, model training in the past 10-20 years is basically based on annotated data. For example, in order to identify a sentence, it is necessary to mark every word of many sentences into text, add model training, and complete model optimization through supervised learning. Now, the industry has begun to try unsupervised learning, and Facebook has proven that unsupervised learning based on unlabeled massive data can also complete speech recognition model training very well.

Q3: What are the starting points for different vendors to do intelligent voice?

In China, there are many manufacturers doing this, such as Xiaomi, Ali and Baidu, but the starting point of each manufacturer is different. Baidu to do intelligent voice is hope by small degrees will search for the product form, from a purely web search text box into a combined with voice interaction more natural search input form, through a small degree of speakers, this product collection of user information, build user, then recommend recommend some originally only through web user search content. Alibaba’s tmall genie hopes to occupy the flow entrance of the home scene, and at the same time complete the ecological construction of AIoT, draw users to xiami Music, Youku, Tmall, ele. me and other content services in Ali Ecology. Xiaomi’s starting point for making smart speakers is obviously different from those two, because xiaomi’s starting point is to build AIoT ecosystem of Xiaomi’s Internet of everything through “Mijia + Xiaai”, covering all aspects of smart life. OPPO as the starting point, the small cloth assistant is hope based on the mobile phone hardware + software products, all kinds of capacity building through small cloth aides, let users perceive products constantly “wisdom, understand you, at the same time to build the company’s brand of science and technology, with the constant improvement of company equipment more ecological, eventually realize mutual melting all strategic objectives.

Q4: What are the current opportunities for voice technology?

I think the opportunities are great. First, the cost of user education is reduced. At present, more and more users are entering from Generation Z, who are more in touch with intelligence. Unlike our parents’ generation or our generation, which entered the intelligent era from a non-intelligent era, these users themselves have a natural familiarity with voice interaction or AI-like interaction. In addition, gen Z has directly entered the digital world and is very familiar with it. For example, a very young child can touch the operation with a mobile phone and become familiar with some virtual things in hardware products. On the other hand, users are becoming more emotionally connected to smart products. In real life, there have been some children will be sad for a long time because of the death of the game character in the mobile phone, but rarely because of the sad thing of someone around, or the passing of a real person around. It’s a reflection of the fact that so much of the digital world has involved the senses. That this time, I feel intelligent assistants in this area has a lot of opportunities, people increasingly fusion with hardware products in the virtual world, also is the so-called enhanced the sense, with the increase of life stress, social pressure, in fact, they are also more hope to communicate with virtual characters, and don’t want to communicate more with people around. In this context, intelligent assistants may become virtual objects that more and more users want to communicate with and touch, with voice technology being the most critical emotional and informational link.

Q5: What are the difficulties facing voice technology at present?

First, users are increasingly worried about privacy leaks. When users use intelligent interactive products, they will gradually become aware of privacy issues. In the past few years, we have seen users on various platforms questioning whether the device is listening in. For example, when I talk to you about an umbrella, taobao or Tmall recommend an umbrella to me at night. So many users want to use voice to access services more easily, but at the same time they fear that their devices will be constantly monitored. I think this is a challenge that the whole industry is facing, including the GDPR of the European Union, which is actually designed to protect the privacy of the whole smart ecosystem. In addition, there is a gap between users’ expectations of voice assistants and the technology’s ability to implement them. Behind the voice assistant is the service, the user’s expectation for the voice assistant is a real person, but it is digital, so the user’s expectation for it is always high. Users often think of intelligence as something that can do anything, but technology has a bottleneck, which means it can only do what it can. But users will have more stringent requirements for intelligent products, he needs intelligent products to check the weather, but also to chat, emotional intelligence and IQ are high. But back to reality. People with high eq and IQ are rare. There is a point in Hackers and Painters that every product ends up looking like the people who made it, because it determines what the soul of the product should look like. For the intelligent assistant, it depends on the engineers, product managers and r&d team to make it. For example, there are 100 people in the team, and the IQ and EQ of the 100 people will determine what the intelligent assistant will look like.

Q6: What will be the application scenarios and forms of intelligent speech in the future?

First of all, from the user perception level, the earliest stage is to meet the user’s text-based interaction, gradually transition to voice interaction, now and in the future, more transition to multi-mode interaction. In terms of application scenarios, AIoT has been widely used in smart home. Users can control the whole device in their home by voice. There is also intelligent driving. In fact, in 2016, Ali cooperated with Zebra Internet Car, including SAIC motor, in an intelligent car, which has equipped voice assistant. Voice assistant has become a standard part of new energy vehicles such as Tesla, Xiaopeng and Nio. The fundamental logic is that in the on-board environment, users are more focused on driving safety. Driving safety means that you can’t check your phone while driving and focus on driving. When you want to listen to music or make a phone call while driving, you can only do it through voice interaction, making driving safer and making the whole driving experience better. Now every car factory is doing this, and has even set up its own research team to create its own technology. In addition, the intelligent assistant needs to make the interaction path between the user and the machine shorter. It used to be possible to get a service through several steps, such as UI touch. But now you can do things as simple as checking the weather or making a phone call in a single sentence. However, the current interaction path is not short, because the current implementation logic is still speech recognition first into text, then text to understand the intention, and finally to dialogue management, and then we have to continue to shorten this path, so that the machine can directly understand what people say, without the transformation of intermediate text. The ultimate form of intelligent speech, we hope, can be separated from the specific product form, can be completely digital. So I think OPPO’s strategy is quite imaginative when it comes to the integration of things. In the end, in fact you don’t care about the things what is a mobile phone, or a sound box, or other intelligent device, stood in the user’s point of view, he only CARES about one thing is when I need any service, I have to do is talk, don’t need to go through the medium of input, other third party to complete some more complex operations.

Q7: How do you view the ecological empowerment of voice assistants?

I think going back to the user itself, whether to ecological development or to the development of a scene is to help the user to solve some of the core needs in a scene. For example, the development of AIoT in the home scene will find that more and more devices, such as traditional lights and air conditioners, begin to support voice control. The logic behind this is to solve the problem of users having difficulty controlling these devices in their homes, and then make the whole home smarter. In essence, voice assistant is the medium of service access and the most natural way for users to obtain services. Its development direction is always to solve the core needs of users.

More exciting content, welcome to pay attention to [OPPO number intelligence technology] public number

Intelligent Voice technology: From where? Whither?

Q1: Can you briefly introduce the development history of voice technology?

Q2: What is the reason for the boom in voice technology in recent years?

Q3: What are the starting points for different vendors to do intelligent voice?

Q4: What are the current opportunities for voice technology?

Q5: What are the difficulties facing voice technology at present?

Q6: What will be the application scenarios and forms of intelligent speech in the future?

Q7: How do you view the ecological empowerment of voice assistants?

Related Posts

Python Tip # 14: Object-oriented programming and the like

5: Linear correlation, generative subspace, norm, special type matrix, vector

The Python tutorial, Module 9, exception handling