TOP100summit 2017: Amazon Echo speaker can recognize people by voice, Chinese engineers reveal design principles

Edited by Cynthia

In 2017, consumer products of artificial intelligence focused on smart speakers. Google and Amazon launched smart speaker products one after another. Alibaba launched Tmall Genie and Xiaomi launched Mi AI speaker. Smart speakers, which can give commands by voice, could one day be the gateway to a smart home, controlling other smart devices in the home by voice.

A few months after Google’s voice recognition app launched support for personalized voice recognition, Amazon’s Echo speaker came with the feature on Wednesday, Oct. 11.

It automatically identifies different people when they speak into the speaker, offering features such as personalized music playlists and personalized shopping. In short, people can be recognized by voice, taking voice control one step further.

Behind Amazon Echo speaker is Amazon’s Alexa intelligent voice technology. Chen Ya, a Chinese engineer, is a senior engineer in Amazon Alexa machine learning team, responsible for the building and optimization of speech recognition and semantic understanding models. On the technical connotation of Alexa one hundred cases deliberately communicated with Chen Ya.

The technical principle of speech recognition

If in the space of a lot of people, let Alexa know who is talking, use the thinking of rivet speech detection, began to wake up the system by Alexa, use an RNN extract embedded anchor, records the phonetic characteristics, then use another RNN phonetic features extracted from the request of the subsequent statements, get an endpoint decision based on this.

Alexa is the first voice command driven AI voice assistant software, Chen said. Just calling “Aleca” can give Alexa, which is already connected to hundreds of apps, work commands, such as playing music, finding information, starting other smart devices or shopping.

Now Alexa is not just a voice recognition tool, but has become a mature operating system that could be operated by voice instead of the traditional phone screen.

Alexa deep learning technology principles

The development of Alexa has carried out large-scale deep learning. When a person grows up to the age of 16, his ears will only listen to sound for 14,016 hours. Alexa’s deep learning is to store thousands of hours of real speech training data into S3, and train the deep learning model using distributed GPU cluster on EC2 cloud.

In terms of training models, Alexa uses several approximation algorithms to reduce the size of updates, and as GPU threads increase, the training speeds up, processing about 90 minutes of speech per second. The human ear can listen to 14,000 hours of speech in 16 years, but Alexa can do it in three hours.

Alexa speech recognition system mainly includes signal processing, the acoustic model, decoder and post-processing and so on four big modules, will collect the voice signal processing in the first place, then voice signal into frequency domain, from 10 milliseconds of speech to extract feature vector for acoustic model, the acoustic model is responsible for the audio into different phonemes, The decoder can derive the highest probability strings of words, which are post-processed to combine the words into easy-to-read text.

Alexa vs. other voice recognition apps

Chen introduced that Alexa is able to occupy 70% market share of the terminal market because of Amazon’s customer first cultural belief. Alexa can succeed, because from product design to development management mode and other aspects adhere to the principle of customer first, user experience innovation, reduce the threshold of smart home, Alexa ecology.

At the 6th TOP100 Global Software Case Study Summit, which will be held on November 9th, Chen Ya will attend as a guest speaker and share amazon’s product design ideas guided by the concept of customer first, as well as amazon’s exploration experience on artificial intelligence and machine learning from the perspective of product design.

TOP100summit 2017: Amazon Echo speaker can recognize people by voice, Chinese engineers reveal design principles

Related Posts

A computing | (a) computing foundation primer

Blockchain electronic contract signing platform, blockchain electronic contract solution

Ant Financial has published two blockbuster papers in ASPLOS’20 for the first time: Confidential computing and Serverless cold start Optimization