Dag Kittlaus, CEO of Viv Labs (viv.ai), has recently given some impressive demonstrations of his latest AI technologies, but one thing that always looks odd during the Viv demonstrations is that he holds the phone very close to his mouth. He’s almost eating his phone every time – it’s like something from Trigger Happy TV. Dag needs to make sure that the quality of captured speech is clear enough for the software to decipher the content of the commands, but it’s not a real-world scenario. We’re not all going to want to "eat" our phones in order to book an airfare for next week.
Voice interfaces are sometimes called conversational interfaces for good reason - they allow users to talk to the device instead of using touch screens and buttons to execute a command. Conversations take place between two or more people who may be next to each other but are often a couple or more meters apart. If we’re going to have conversations with our new digital assistant friends, we have to be able to talk to them at a distance as well as up close. Dag needs something on the podium that he can talk to in a normal voice.
Microphone arrays
Well, everyone knows the problems with microphones, particularly mobiles, when used in a noisy environment such as an auditorium, busy restaurant or a factory, or even in the quiet on top of a mountain. The quality of captured signal is never great because of all the other noises going on around. How does the phone know to just capture your voice and not any of the surrounding noises? Can it shield out the reverberation generated by surrounding hard surfaces that the sound bounces off? How’s it going to capture someone speaking three meters away, even in a very quiet room?
The good news is that there’s a solution to this, far-field microphones that can capture individual voices several meters away. But we don’t hear much about it, all the chat is about AI engines like Viv, home hubs like Google Home and Amazon Echo, and chatbots. It’s time that the actual voice capture interface started to get some attention because it’s the key to unlocking the potential of all these voice applications.
There has been a lot of research into far-field microphones – do a search for far field microphones and you’ll find lots of academic papers in the top 20 results, and there are plenty more papers lower down the search. At the root of many of these papers is a beamformer, which is created by combining the input from two or more microphones into a lobe that focuses on the voice source while excluding surrounding sounds. The beam can be steered to follow a voice source as it moves around a room or to capture a different voice, without moving the voice interface itself.
How complex are microphone arrays?
Sounds great but before we jump in and create a beamformer, let’s consider some of the features that a far-field microphone array includes. First, microphone arrays can have different topologies, such as linear or circular. We need to consider the best one for our particular product; generally an array with more microphones delivers a more precise narrower lobe with better gain, but it also means more processing is required and more power which might be an issue for a consumer device. Then there’s the beamformer implementation itself. Microphone arrays involve lots of complicated trigonometry and calculations about the direction of arrival of the sound source, running to millions of cycles per second. Then the analog PDM signal captured by the microphone has to be converted to a standard PCM digital signal and sampled down to something useful, all of which requires multiple filters and a lot more processor compute. And before you think of using an array of any old microphones - you can’t; microphones are calibrated for many different levels of performance, so if you get the wrong microphone initially you’ll spend all your time trying to fix a fundamentally broken solution.
The microphone array must be smart so it can differentiate between voice sources and music sources; there’s no point in capturing a voice source if the microphone keeps picking up background music as well.
The architecture on which the microphone array is built must have low latency to enable a natural speech experience, especially those featuring bi-directional communication. Too much buffering in the system will introduce lag, affecting the overall performance of the customer experience.
Why do you need acoustic DSP?
But beamformers are just part of the product we need to build to stop Dag eating his phone. The captured voice streams are subject to a lot of additional noise such as echo and reverberation caused by signals bouncing around the hard surfaces in the surrounding environment. These effects change from one environment to another; the amount of reverberation in an auditorium is very different to your kitchen or bedroom, or your office. The voice interface must implement additional signal processing for echo cancellation and de-reverberation to the voice data before it can be passed to the speech engine, and the level of DSP will be dictated by the environment it is tuned for. The audio stream may be clean but some gain control is probably necessary to boost the signal before it’s passed to the speech engine.
We’re told that speech recognition engines need to be at least 95% accurate for users to adopt them, but that cannot happen if the signal passed to it isn’t the best possible quality; anything less than 99% clarity will have an adverse affect on the performance of the speech engine and digital assistant.
Taking control of voice interfaces
Above all, voice controllers should only be active when they are explicitly addressed, usually by a keyword. In our case there will be lots of background chatter but when Dag wants to do his demonstration, he needs to be able to activate the device and get it to lock onto his voice. And when he says something that shouldn’t be transmitted to the Cloud, he needs to be able to shut the device down. Which raises another requirement for voice controllers – how does Dag know the device is active and listening to him? The device must be able to communicate back to Dag; it may be to playback a reply or questions that the digital assistant needs for clarification (just like a real conversation), or it might be some other visual effect such as LEDs to show whether the device is active and if so it’s current state. Many personal robots in Japan use facial expressions to indicate response, so the voice controller may have to be able to drive an LED screen.
Don't forget the AI connection
Finally, Dag needs his instructions to connect to Viv in the Cloud, which means the voice controller needs to implement a secure robust WiFi or Bluetooth connection. Maybe Viv is running on a local system, in which case the audio would have to be piped to an application processor using a standard like USB or TDM. Whichever is most suitable, the voice controller must have flexible connectivity to enable the backhaul of the voice data to Viv.
Putting it all together
So that’s the shopping list for a voice controller that we can use to build a product that stops Dag eating his phone: lots of compute, DSP, very low latency, power efficiency, audio playback, WiFi/Bluetooth/USB connectivity and flexible GPIO support.
Each of these features can be implemented using discrete devices, but remember that each device adds to the complexity of the final design – more timing issues, more PCB real-estate, more cost.
Multicore microcontrollers built on a deterministic architecture, such as xCORE, have the potential to offer an integrated, and therefore lower cost solution. Such devices can integrate microphone DSP, acoustic DSP and flexible connectivity to standard interfaces with nanosecond response time and low latency. They can also provide audio playback and secure thin-client connectivity to Cloud services, all in a single device.
Conclusion
Speech recognition is coming to your home, office and leisure activities soon – the big technology companies wouldn’t be investing billions of dollars if they thought otherwise. The potential of technologies like Viv will be unlocked by voice interfaces that provide accurate and consistent far-field voice capture, so Dag Kittlaus won’t have to eat his phone every time he does another demo.
DOWNLOAD DSP FOR SMART MICROPHONES WHITEPAPER