The xCORE 64-bit accumulator and memory bus

November 3, 2015, 10:40 am

≫ Next: Hi-Res Audio Weekly News Round-up 10

≪ Previous: "Does anyone want any toast?"

In a previous blog I discussed the new DSP capabilities of the xCORE-200 architecture at a very high level. I'd now like to look at some of those DSP capabilities in more detail.

XMOS xCORE-200 devices are built around a 32-bit fixed point architecture, which includes a 64-bit accumulator for enhanced dynamic range. This architecture was chosen because it gives flexibility to implement many different forms of fixed-point and integer arithmetic efficiently.

In addition to the raw 32/64 bit performance of the xCORE-200, the device family can scale in terms of processor cores and MIPS. At the lower end the 8-core 500 MIPS device (250 MMACS sustained) is ideal for High Resolution stereo audio I/O and processing applications, while the high-end 32-core 2000 MIPS device (1000 MMACS sustained) can support up to 64 channels of time division multiplexed (TDM) audio.

When implementing DSP algorithms it is necessary to have highly optimized access to large amounts of memory. The xCORE-200 includes 256 KBytes to 1 Mbyte on-chip zero wait SRAM that can be accessed by the processors cores through a 64-bit non-blocking tightly coupled memory interface. This allows all processor cores to access the on-chip memory without concern that the processor pipeline will stall while waiting for another core to complete its memory accesses.

The combination of 64-bit accumulator and 64-bit internal memory bus allows rapid load and store of data, coefficients and results. All loads and stores are 64-bit aligned and supported in the compiler. In fact, all global data arrays are 64-bit aligned to allow the compiler to use 64-bit loads and stores where ever possible.

xCORE-200 also incorporates a dual issue instruction set, which allows many of the mathematical operations such as shift and add to be issued in parallel with other instructions such as memory accesses, loop counting and modulo address pointer management. The instruction set also supports instructions for saturation, CRC, and zip/unzip instructions for bit, word and long word manipulation.

The xCORE-200 DSP library released by XMOS, supports all of the above techniques to give hand optimized performance.

Download lib_dsp library

First Name (required field)

Last Name (required field)

E-mail Address (required field)

Company Name (required field)

Email Updates

Unapproved

↧

Hi-Res Audio Weekly News Round-up 10

August 19, 2015, 8:51 am

≫ Next: Hi-Res Audio Weekly News Round-up 11

≪ Previous: The xCORE 64-bit accumulator and memory bus

Did you know we publish a weekly Hi-Res Audio newsletter covering the latest XMOS and industry news? For those not subscribed to the newsletter here is a typical example from August 19, 2015.

XMOS silicon devices are used in audio products worldwide to enable Hi-Res Audio.

Here are our top picks from the last 7 days of Hi-Res Audio news.

Peachtree Audio: Shift headphone amp

Christopher Jones of Wired ranks Peachtree's new portable HPA among "the most listenable and vivid I've heard". With integrated DAC, built-in battery and support for PCM 384kHz/43-bit and DSD128, the Shift turns your Android/iOS device into a portable Hi-Res Audio player.

LG offers free Hi-Res downloads

LG have introduced a new Hi-Fi music service. But unlike other services they are offering LG smartphone owners free Hi-Res downloads every month. The reason? Simple, they say consumers are demanding better quality audio.

Best of British

What HiFi have published their recommendation for the "Best music streaming Hi-Fi system". At its heart is Cambridge Audio's award winning CXN network music player.

Do Samsung have the Edge?

With the Galaxy S6 Edge+ Samsung continue to improve the audio capabiity and features of their smartphones. Supporting Hi-Res Audio up to 192kHz/24bit, the Galaxy S6 Edge+ also upscales lower resolution MP3s to 192kHz/24-bit and promises a great listening experience.

Like this news round-up? Get it delivered straight to your inbox.

↧

Hi-Res Audio Weekly News Round-up 11

September 2, 2015, 9:54 am

≫ Next: Why voice user interfaces with beamforming microphones work better with xCORE-VOICE

≪ Previous: Hi-Res Audio Weekly News Round-up 10

Did you know we publish a weekly Hi-Res Audio newsletter covering the latest XMOS and industry news? For those not subscribed to the newsletter here is a typical example from September 2, 2015.

Many Hi-Res Audio product manufacturers will debut their latest designs at IFA 2015, Berlin starting this weekend. Here are a few we think might be worth a closer look:

Onkyo: X9 speaker/H500M and E700M headphones

Japanese audio heavyweight, Onkyo, have pre-announced the X9 speaker, H500M headphones and E700M earphones as part of their HRA range. With the onkyomusic online store delivering HRA downloads to UK, US, and Germany, they've got Hi-Res Audio covered.

Raumfield: Soundbar for TV and multiroom audio

Need home theatre or a music system? Wi-Fi or hi-fi? The Raumfeld Soundbar's advanced audio and streaming technology provides a single point solution. Stop agonising and start enjoying your music.

Samsung: Wireless 360 speakers

The new Samsung R5, R3 and R1 360 speakers include a new tap-and-swipe interface, with a bolstered multi-room app for Android and iOS smartphones, and tablets. Smartwatch owners can control music playback, speaker volume, playlist and queue searching.

Astell & Kern: Ti8e hand-made earphones

They cost Â£799, but each set of Ti8e earphones are hand-made by Beyerdynamic and integrates their Tesla technology, which transmits the entire audible frequency spectrum within a single driver. The Ti8e earphones make an ideal partner for A&K's portable high-res music players.

Like this news round-up? Get it delivered straight to your inbox.

↧

Why voice user interfaces with beamforming microphones work better with xCORE-VOICE

April 28, 2016, 2:25 am

≫ Next: Voice recognition is everywhere, but accurate capture is the key

≪ Previous: Hi-Res Audio Weekly News Round-up 11

We love the Amazon Echo at XMOS. It’s a new-to-the-world category of product, brimming with possibilities as a digital assistant, a hub for home automation as well as a point of presence to allow us to access all of Amazon’s goods and services.

At its heart is a piece of technology we would call a smart microphone. This enables the Echo to capture voice samples with a high degree of accuracy before transmitting them to the Amazon Voice Services in the cloud where the query is processed before the answer is returned to the device in the form of Alexa’s soothing tones contained in an MP3 file.

The smart microphone is actually composed of a number of individual microphones linked together by a technology known as beamforming, where complex trigonometric functions enable the individual microphones to be combined to create a highly directional beam, which can focus on the current speaker. Not only that, the beam can be steered to track a speaker moving in the room so that voice samples are never lost. This capability, coupled with advanced echo cancellation to prevent whatever is being played through the Echo’s speaker at the time being captured by the microphones is combined with noise suppression to remove background sounds such as traffic noise or air conditioning from the signal.

It’s an amazing piece of kit that makes the dream of “talking” to a computer a reality. As technologists, we’re impressed by its functionality; as marketers, we’re tantalized by the view of the future that the Echo offers us.

That’s why I’m excited about introducing XMOS’ latest product, the xCORE-VOICE XVSM-2000, which is a smart microphone chip designed to bring voice interface capability to the consumer, domestic and computer markets.

The performance and flexibility of XVSM-2000 opens up new possibilities for designers and marketers looking to integrate a voice user interface into existing products, or use the capability to create new product categories, much like the Amazon Echo. We’ve integrated sophisticated microphone and voice DSP functions, delivering voice activity detection, beamforming and beam steering alongside noise suppression, echo cancellation and de-reverb functions. This sits on top of our PDM microphone aggregation capability that we announced last month.

XVSM-2000 still retains all the features that our customers love about our xCORE technology: multicore performance - up to 1800MIPS available for your application - flexible I/O to create the mix of interfaces unique to your product, our industry-standard USB or TDM backhaul solutions, and of course, all programmed via the ‘C’ language.

We’ve created the xCORE-VOICE Smart Microphone development board to enable customers to quickly evaluate our technology and develop their own applications. The board has an XVSM-2000 device, 7 PDM microphones, USB-B connector (TDM outputs via a header) and audio out via a headphone jack, and is supported by voice and microphone libraries available from XMOS.

The xCORE-VOICE Smart Microphone development board and XVSM-2000 are available today – contact your local XMOS sales representative for more details or visit www.xmos.com/products/voice.

Find out more about the xCORE-VOICE Smart Microphone board

↧

Voice recognition is everywhere, but accurate capture is the key

May 19, 2016, 10:58 pm

≫ Next: How voice user interfaces and natural language programming will change the way we interface with the world around us

≪ Previous: Why voice user interfaces with beamforming microphones work better with xCORE-VOICE

Voice user interfaces - you can scarcely avoid the current hype in the media as giants like Amazon, and Google jostle to exploit the explosion of possibilities that advancements in natural language technologies are providing. Today’s neural networks use algorithms to process language through ever-deeper layers of complexity. Machines can now understand the meaning and intent of spoken words with unprecedented levels of accuracy. This has sparked a revolution for the power of voice. Google have used their current I/O 2016 conference to announce it’s digital assistant Google Assistant, which will directly challenge Microsoft, Facebook and Apple bot hype - automated programmes you can interact with via natural language. They announced Google Home, which we predicted earlier this year with Google’s announcement to open up its cloud speech API with the intent to develop an ecosystem that would rival the Amazon Voice Services. The conference also has a dedicated session that will talk about "how to build a smart RasPi Bot with Cloud Vision and Speech API."

What else is going on?

So is it perhaps coincidence that although this is just a small part of Google’s conference, Amazon has also just revealed new voice remote capabilities for its Fire TV devices. Amazon has a head start on the market with its Amazon Voice Services and the proliferation of Amazon Echo and its virtual digital assistant, Alexa. The new capabilities announced this week will expand the devices’ voice recognition features so that by just using voice commands, users can search for content, launch apps and play videos or even read from the Kindle library with speech-to-text enabled books. Hot on the heels of the big players are companies like SoundHound who have integrated hands-free voice commands using its own natural language understanding software, Houndify.

Voice capture requires high quality hardware

Software algorithms have certainly played a huge part in unleashing this explosion of possibility in voice user interfaces, but what about the hardware? Key to the success of voice recognition technology is fundamentally the quality of the voice capture and the response to the command or question. Until this is 95% accurate, then voice won’t be widely adopted as an interface in everyday life. One way to provide high quality voice capture is to use a smart microphone comprising a steerable array of several MEMS microphones. These can be used to create a “beam” that listens in different directions. The smart microphone uses signal processing to remove noise, echo and reverberation before the voice signal is passed to the speech recognition engine. It can listen out for instructions and then focus in on the location of the voice. It does this by adjusting the phase of the incoming sounds and matching the different parts to each audio channel. As well as increasing the accuracy of the voice input, as other sounds are less likely to interfere, the smart microphone allows different voices to be separated not only by how they sound but also by their location in the room.

Of course, there is slightly more to it than just the voice capture as the data has to travel back through the rest of the system and over a broadband backhaul link back to the cloud as quickly as possible, but the idea is that the signal that is captured at the front end is clean. This concept allows the creation of a whole new class of highly effective, interactive IoT devices.

The voice interface revolution has started

We are seeing just the beginning of the voice interface revolution with the frenzied media coverage of the big companies. As an agnostic supplier of embedded solutions in the voice terminal space, we know that our development boards are being used to prototype all kinds of applications, and voice recognition could be just the tip of the iceberg as designers bring us even closer to the smart home.

↧

How voice user interfaces and natural language programming will change the way we interface with the world around us

June 7, 2016, 12:05 am

≫ Next: Hello voice interfaces, goodbye screens and buttons

≪ Previous: Voice recognition is everywhere, but accurate capture is the key

Microphones capture analog signals and thanks to digitization, have ridden the back of Moore’s Law to get ever smaller and cheaper to the point where the ability to capture decent sound in a tiny device has improved dramatically. This has taken audio capture out of the sound booth and recording studios and democratized it. Thanks to their ubiquity in handsets, laptops, headsets and media tablets, nearly 5 billion MEMS microphones are expected to be shipped in 2016 alone, at an average price point that will deliver four or five of them for less than a Euro or dollar.

A Technological Tower of Babel

In a world hungry for innovation, low price per unit and reasonably high quality breed proliferation. In parallel with the development of ever cheaper mics, Apple, Google and others have pushed the development of Natural Language Interfaces (NLIs) to the point where 4th graders are increasingly likely to speak to SIRI instead of tapping on a screen to find the nearest Italian restaurant. And by the time they become ravenous teenagers, they may be reluctant to leave the couch at all and instead shout instructions while lounging on the cushions much like the space potato kids in Wall-E. This means that one day in the near future, the average home will possibly resemble a technological Tower of Babel as everyday devices will interface and receive commands from users via spoken commands.

This is a cornerstone in the Internet of Things, the belief and goal of having home and office appliances fully connected and interconnected via the Cloud and inducted into the chattering classes. The downside, much like the original tower of biblical fame, is that users and devices will possibly speak many languages and protocols without the benefit of a universal Babel Fish to make sense of it all. In addition, add the cacophony of sound in the modern world, along with the challenges of hearing a child’s voice, an adult’s voice or even a mouse breaking wind (for pest control apps, of course) and one begins to visualize a multivariate sound profile that may confuse the average dolphin. The opportunity for a digital messiah to arrive on the scene and make sense of it all seems ripe.

The train has left the station

Google, having long ago leapfrogged AltaVista, Netscape, AOL and Yahoo in search, recently planted a flag in the pinnacle of home innovation. The all-in-one “Home” hub will allow dwellers to request anything by speaking at a smallish, roundish cube that will sit patiently awaiting the next beck and call. At the heart of this are the NLI and all those cheap, ubiquitous MEMS microphones with the heavy computational processing pushed to the Cloud. For companies not called Apple or Alphabet, this leaves the challenge of how to gain a toehold in this space before Google and Amazon (with the Echo) own the market entirely.

To create such opportunity, XMOS has developed a microphone controller, based on their xCORE-VOICE family. XMOS isn’t in the MEMS microphone business nor does it build search algorithms, but it does produce fantastic multichannel audio interfaces. The new microphone controller can capture up to 16 microphone inputs, apply signal processing to the captured input and backhaul the clean audio signal to a cloud server over a secure wireless connection, or pass the signal to an application processor; all in a single device. It promises to quickly take the theoretical into the practical.

Time to catch up

How does this translate into something that you and your family can use to babble away at white goods in the future home? First you need to acknowledge that the environments we live and work in are very busy and noisy. To implement an effective NLI, the voice interface needs a high signal-to-noise ratio that reduces unwanted background sounds, and eliminates echo and reverberations generated by the surrounding environment. Second, the voice interface needs to identify a single voice requesting attention amidst the cacophony of life at least 12-15 feet away as well as nearby, so you don’t have to hold the input device. Finally, the device needs to be able to track the voice of the speaker as they move around the room.

Now consider an array of microphones that are attuned for a specific sound frequency, pitch and volume. Each microphone will detect sounds slightly differently depending on the capabilities of the microphone, the direction the sound wave hits the microphone, the distance between the speaker and device, and how much the surfaces in the room absorb the sound wave or cause it to bounce back into the room. With some clever mathematics the signal from two or more of the microphones can be aligned in a beam that focuses on shared features of interest, and follows the source as it moves around. General noise from appliances like food processors in the kitchen, the vacuum cleaner in the middle of the room, or the leaf blower on the back patio, can be also be ignored or cancelled out using additional signal processing once the voice signal has been captured.

xCORE-VOICE - The start of the revolution

xCORE-VOICE processors include all this clever mathematics, as well as secure wireless connections to NLI servers in the cloud or a separate processor, in a single silicon chip. XMOS have a circular development platform with seven microphones - it’s about the diameter of a soda can and the thickness of a graham cracker - to get you started. The optimum placement of microphones is highly application specific, however, so if you want to use more microphones or fit into a different layout the xCORE-VOICE processor is completely flexible to fit your requirements.

Voice control of electronic devices is one of the ultimate analog/digital interfaces and if Amazon and Google are betting big on the home control space, there is a possible bonanza in store for anyone else who gets it right. XMOS provides a unique opportunity for designers and developers to be actively engaged in this modern day goldrush.

↧

Hello voice interfaces, goodbye screens and buttons

June 14, 2016, 7:20 am

≫ Next: Voice interfaces unleash Orange's original vision

≪ Previous: How voice user interfaces and natural language programming will change the way we interface with the world around us

A massive change is coming in the way that we engage with the electronics that surround us at home, at work and in the built environment. We will be liberated from computer keyboards, touch screens and apps that keep us glued to our smartphones and laptops. Instead, there will be billions of voice-aware products that we will use to talk to get information and entertainment, and manage our everyday tasks.

Voice interfaces are evolving

Initially we will talk to intelligent agents such as Apple Siri, Google Assistant, Facebook M, Microsoft Cortana and Viv using the devices we're familiar with, mobile phones, tablets, laptops, workstations, kiosks. As businesses evolve and consumers become more confident with embedded voice controllers, speech recognition interfaces will spread to ecosystems that enable entertainment systems, communications, home automation and security, based on products like Amazon Echo and Google Home. Finally voice interfaces will become the interface of choice in vehicles, domestic appliances and white goods. Products will no longer have screens and buttons, we will just ask questions, clarify requests and send messages as we move around our homes and offices.

Innovations will be necessary to deliver the products we use every day to manage our work, family and leisure time. Speech and biometric recognition engines will be developed in the Cloud where algorithms can take advantage of the huge processing power available, and services updated regularly to provide better user experience. Voice interfaces will augment, and eventually replace, the buttons and screens we currently use to be control products in our homes and offices. New product categories will proliferate, many of them simple and easy to upgrade so they can take advantage of new services and upgraded technologies as they appear, and not be obsolete in five years.

The barriers are starting to fall

The first steps to voice interface adoption have already been triggered by the launch of Amazon Echo and the latest virtual assistant announcements from Apple, Google, Microsoft, Facebook and Viv. But that is just the start of what is a massively complex set of changes. Legislation will have to be updated to cover personal data, internet security, and child protection. Infrastructure will have to provide internet coverage 100% of the time so security systems are always secure and autonomous vehicles are always connected. Robots will appear in our homes, offices, hospitals and general environnment, which we will have to communicate with. New business models will be developed that will disrupt the monopoly of big technology companies. And so on.

But for people who already use voice instructions on their mobile devices and young children who think that they can swipe any screen they touch, voice is the most natural way to interact with technology. They will be the people driving the innovations, initially as consumers and later as developers, to make voice interfaces the ubiquitous human-interface. No longer will we have to learn the way products work, they will have to learn from us. Everyone will be able to use natural language interfaces to interact with technology that makes our lives easier to manage and enjoy.

↧

Voice interfaces unleash Orange's original vision

July 8, 2016, 2:38 am

≫ Next: Microphone arrays to unlock the potential of speech recognition

≪ Previous: Hello voice interfaces, goodbye screens and buttons

I've just watched a promotional film that Orange recorded in 1999 called The Future's Bright The Future's Orange. It was striking that all the film's ambitions are as relevant to voice interactions today as they were to the world of mobile communications almost two decades ago.

There are things in the film that haven't materialised, and some that are finally coming true - Hilary Clinton is running for president. But from a technology perspective why not have a personal assistant that you can talk to who helps with the mundane chores of everyday life, gets your children organised, helps with planning your children's birthday parties, buys products and gifts, organises business meetings and hotel bookings? You should be able to check the weather, traffic and house security as you move around your home or on the way to work and personal meetings. Voice biometrics can deliver extra security checks for your children or when you're travelling. People and machines should be able to contact you if there's important information, be it the clinic or car management system.

As I watched the film it was easy to imagine how each 'beep' should be replaced with everyday commands and people could hold conversations with voice-enabled products within the home, office, school, car or just on the beach. There's no need to have user interfaces with buttons and screens when everyone, including young and old, can just talk with the technologies that we use to run our lives.

Orange is no more, but in 1999 it launched a multi-national company on the back of the film and changed the way that people lived with mobile technologies. 17-years later the power of voice is being unleashed in a new generation of things and objects, enabled by XMOS' market-leading voice technology - setting users free from the constraints of screens and providing intuitive, pervasive experiences.

Thanks to DmanLondon for posting the video- https://www.youtube.com/watch?v=lOVKANECxjg.

↧

Microphone arrays to unlock the potential of speech recognition

July 20, 2016, 6:47 am

≫ Next: What’s happened to Hi-Res music streaming in 2016?

≪ Previous: Voice interfaces unleash Orange's original vision

Dag Kittlaus, CEO of Viv Labs (viv.ai), has recently given some impressive demonstrations of his latest AI technologies, but one thing that always looks odd during the Viv demonstrations is that he holds the phone very close to his mouth. He’s almost eating his phone every time – it’s like something from Trigger Happy TV. Dag needs to make sure that the quality of captured speech is clear enough for the software to decipher the content of the commands, but it’s not a real-world scenario. We’re not all going to want to "eat" our phones in order to book an airfare for next week.

Voice interfaces are sometimes called conversational interfaces for good reason - they allow users to talk to the device instead of using touch screens and buttons to execute a command. Conversations take place between two or more people who may be next to each other but are often a couple or more meters apart. If we’re going to have conversations with our new digital assistant friends, we have to be able to talk to them at a distance as well as up close. Dag needs something on the podium that he can talk to in a normal voice.

Microphone arrays

Well, everyone knows the problems with microphones, particularly mobiles, when used in a noisy environment such as an auditorium, busy restaurant or a factory, or even in the quiet on top of a mountain. The quality of captured signal is never great because of all the other noises going on around. How does the phone know to just capture your voice and not any of the surrounding noises? Can it shield out the reverberation generated by surrounding hard surfaces that the sound bounces off? How’s it going to capture someone speaking three meters away, even in a very quiet room?

The good news is that there’s a solution to this, far-field microphones that can capture individual voices several meters away. But we don’t hear much about it, all the chat is about AI engines like Viv, home hubs like Google Home and Amazon Echo, and chatbots. It’s time that the actual voice capture interface started to get some attention because it’s the key to unlocking the potential of all these voice applications.

There has been a lot of research into far-field microphones – do a search for far field microphones and you’ll find lots of academic papers in the top 20 results, and there are plenty more papers lower down the search. At the root of many of these papers is a beamformer, which is created by combining the input from two or more microphones into a lobe that focuses on the voice source while excluding surrounding sounds. The beam can be steered to follow a voice source as it moves around a room or to capture a different voice, without moving the voice interface itself.

How complex are microphone arrays?

Sounds great but before we jump in and create a beamformer, let’s consider some of the features that a far-field microphone array includes. First, microphone arrays can have different topologies, such as linear or circular. We need to consider the best one for our particular product; generally an array with more microphones delivers a more precise narrower lobe with better gain, but it also means more processing is required and more power which might be an issue for a consumer device. Then there’s the beamformer implementation itself. Microphone arrays involve lots of complicated trigonometry and calculations about the direction of arrival of the sound source, running to millions of cycles per second. Then the analog PDM signal captured by the microphone has to be converted to a standard PCM digital signal and sampled down to something useful, all of which requires multiple filters and a lot more processor compute. And before you think of using an array of any old microphones - you can’t; microphones are calibrated for many different levels of performance, so if you get the wrong microphone initially you’ll spend all your time trying to fix a fundamentally broken solution.

The microphone array must be smart so it can differentiate between voice sources and music sources; there’s no point in capturing a voice source if the microphone keeps picking up background music as well.

The architecture on which the microphone array is built must have low latency to enable a natural speech experience, especially those featuring bi-directional communication. Too much buffering in the system will introduce lag, affecting the overall performance of the customer experience.

Why do you need acoustic DSP?

But beamformers are just part of the product we need to build to stop Dag eating his phone. The captured voice streams are subject to a lot of additional noise such as echo and reverberation caused by signals bouncing around the hard surfaces in the surrounding environment. These effects change from one environment to another; the amount of reverberation in an auditorium is very different to your kitchen or bedroom, or your office. The voice interface must implement additional signal processing for echo cancellation and de-reverberation to the voice data before it can be passed to the speech engine, and the level of DSP will be dictated by the environment it is tuned for. The audio stream may be clean but some gain control is probably necessary to boost the signal before it’s passed to the speech engine.

We’re told that speech recognition engines need to be at least 95% accurate for users to adopt them, but that cannot happen if the signal passed to it isn’t the best possible quality; anything less than 99% clarity will have an adverse affect on the performance of the speech engine and digital assistant.

Taking control of voice interfaces

Above all, voice controllers should only be active when they are explicitly addressed, usually by a keyword. In our case there will be lots of background chatter but when Dag wants to do his demonstration, he needs to be able to activate the device and get it to lock onto his voice. And when he says something that shouldn’t be transmitted to the Cloud, he needs to be able to shut the device down. Which raises another requirement for voice controllers – how does Dag know the device is active and listening to him? The device must be able to communicate back to Dag; it may be to playback a reply or questions that the digital assistant needs for clarification (just like a real conversation), or it might be some other visual effect such as LEDs to show whether the device is active and if so it’s current state. Many personal robots in Japan use facial expressions to indicate response, so the voice controller may have to be able to drive an LED screen.

Don't forget the AI connection

Finally, Dag needs his instructions to connect to Viv in the Cloud, which means the voice controller needs to implement a secure robust WiFi or Bluetooth connection. Maybe Viv is running on a local system, in which case the audio would have to be piped to an application processor using a standard like USB or TDM. Whichever is most suitable, the voice controller must have flexible connectivity to enable the backhaul of the voice data to Viv.

Putting it all together

So that’s the shopping list for a voice controller that we can use to build a product that stops Dag eating his phone: lots of compute, DSP, very low latency, power efficiency, audio playback, WiFi/Bluetooth/USB connectivity and flexible GPIO support.

Each of these features can be implemented using discrete devices, but remember that each device adds to the complexity of the final design – more timing issues, more PCB real-estate, more cost.

Multicore microcontrollers built on a deterministic architecture, such as xCORE, have the potential to offer an integrated, and therefore lower cost solution. Such devices can integrate microphone DSP, acoustic DSP and flexible connectivity to standard interfaces with nanosecond response time and low latency. They can also provide audio playback and secure thin-client connectivity to Cloud services, all in a single device.

Conclusion

Speech recognition is coming to your home, office and leisure activities soon – the big technology companies wouldn’t be investing billions of dollars if they thought otherwise. The potential of technologies like Viv will be unlocked by voice interfaces that provide accurate and consistent far-field voice capture, so Dag Kittlaus won’t have to eat his phone every time he does another demo.

DOWNLOAD DSP FOR SMART MICROPHONES WHITEPAPER

↧

What’s happened to Hi-Res music streaming in 2016?

August 25, 2016, 9:20 am

≫ Next: ReSpeaker: Modular hardware platform adds voice interfaces to any product you like

≪ Previous: Microphone arrays to unlock the potential of speech recognition

A lot has changed in the streaming audio sector since I wrote my blog last year. And while outwardly it looks similar, some companies have gone into receivership and others are seriously contemplating IPOs mainly due to the costs involved in delivering a lossless streaming service. Yet the rewards are likely to be great for those that can make it work, so let's see what's happened to the main players in the last year.

Tidal

www.tidalhifi.com

Tidal remains one of the leaders in hi-resolution audio streaming. It has an exceptional catalogue with over 40 million songs that include exclusive artist recordings including those by recently deceased Prince, and it pays its artists a better rate than Spotify - see Information is Beautiful infographic. Tidal is available in over 45 countries, and offers a monthly subscription for $9.99 / £9.99 for standard MP3 quality streaming and $19.99 / £19.99 for premium Hi-Res content.

But Tidal has suffered constant management turmoil and cash flow problems ever since JayZ bought the company in March 2015. Some people just see Tidal crashing into oblivion but others see JayZ using it as a business opportunity to generate significant wealth for himself and recording artists by flipping Tidal and selling it to one of the big technology companies, such as Apple.

Qobuz

www.qobuz.com

Like Tidal, Qobuz is an on-demand streaming service and download store. It launched in 2008 and claimed to be the first music service in the world to offer 24-Bit Hi-Res files for streaming. It offers CD quality lossless streaming at 16 bit/44.1kHz (£19.99 per month) as well as studio master downloads at 24 bit/192kHz (£219 per year). Qobuz has a catalog that covers 28,000 labels and producers, with more than 30million tracks it caters to all music tastes.

However, the service has struggled and went into receivership at the end of 2015. It was bought by the French entertainment company Xandrie in January 2016 with a promise of fresh investment and expansion. This, for now, seems to have solved the immediate cash-flow problems and we'll wait to see if it has a long-term future.

Deezer

www.deezer.com

Deezer remains one of the top five streaming audio vendors globally. It's known more for its MP3 streaming service (and here in the UK for its annoying adverts, saved only by voiceover by the dulcet tones of Henry Blofeld) but launched the Elite lossless streaming offering in 2014.

Deezer Elite costs $19.99 / £19.99 per month. It's debut in the US had stronger than expected demand, and has been rolled out in more than 150 countries. With more that 25million tracks out of a 40million catalog available in lossless format, Deezer Elite offers the widest selection of hi-resolution audio today.

Deezer acknowledge, however, that they need to find a different approach to consumer engagement and streaming services in order to survive against Spotify, Apple Music and Tidal – just having the biggest catalog with the widest worldwide access is not a guarantee of success. Analysts were surprised when Deezer scrapped a plan to IPO in October 2015, which would have raised in the region of $400 million, but Deezer appear confident that they can turn information they have on their users into advertising opportunities that allow them to effectively moneterize their user base, without suffering the vagaries of the stock market.

OraStream

www.orastream.com

Orastream continues to offer lossless audio in conjunction with a proprietary adaptive bitrate streaming technology that adjusts on the fly to be optimised for the device and connection bandwidth. On Wi-Fi you'll receive true hi-resolution music up to 192kHz/24-bit. But on a phone's 3G connection, you'll receive a lower quality track to enable the music to keep playing without disruption. The big downside with OraStream is its catalog, which is restricted to niche (mostly classical, jazz and electronic dance) music.

Apple Music

www.apple.com/music

After much speculation Apple eventually launched it's streaming service Apple Music at the end of June 2015. It was inevitably a success with 10million subscribers signing up in the first six months, but suffered from general teething and usability issues, and no hi-resolution audio option. At WWDC in June 2016, Tim Cook announced that Apple Music would be redesigned from the ground up to provide a better user experience but there was still no comment on hi-resolution audio support.

Rumours have continued throughout 2016 that Apple might use the change from 3.5mm headphone jack to a USB-C connector on the iPhone 7 to introduce hi-resolution audio streaming as a way to differentiate the new device from older iPhones and competitor's products. The rumours about the acquisition of Tidal lend some support to the possibility as it would give immediate access to a catalog of hi-resolution quality music if Apple see an opportunity to get ahead of the competition, but we will have to wait until the iPhone 7 launch in September to learn more.

Conclusion

There is a clear demand for improved audio experience, which means that Hi-Res Audio is here to stay. In 2015, there was a race to see who would be able to become the world's first Hi-Res Audio streaming service, but the bigger question in 2016 seems to be how companies can effectively moneterize their services. A lot of attention will be focused on Spotify, which raised $1billion in debt in March 2016 under very strict terms, with a view to IPO in 2017 - remember Spotify has still to turn a profit. Companies like Tidal, Qobuz and Deezer will continue to try and identify their best business models, with IPO or acquisition by bigger companies a very distinct possibility.

Hi-resolution audio needs a commitment from one of the big technology giants to drive the market forward. Apple has the pedigree and probably has more to gain from supporting Hi-Res Audio streaming than Google/Samsung/Sony, which indicates why the rumoured acquisition of Tidal has generated so much interest. But given the recent announcements about the reworking of Apple Music it seems more likely that Apple will focus on getting that right before introducing hi-resolution audio streaming.

The iPhone 7 announcement in September will be interesting, but don't be surprised if there is no mention of mainstream support for Hi-Res Audio streaming services in 2016.

↧

ReSpeaker: Modular hardware platform adds voice interfaces to any product you like

August 30, 2016, 6:51 am

≫ Next: Will 2016 be a landmark year for lossless hi-resolution streaming audio services?

≪ Previous: What’s happened to Hi-Res music streaming in 2016?

I've mentioned before that voice interfaces will provide new opportunities for great innovative products. But while big technology companies like Amazon, Google, Apple, Microsoft and Facebook invest huge amounts of money into developing the natural language interfaces that enable speech recognition, we've so far seen few smaller companies taking advantage of the rich diversity of available hardware and software that is available.

So it's great news that the Chinese hardware company Seeed Studios have announced on KickStarter their latest product, a modular development platform called ReSpeaker, with the objective of adding voice interaction to products in your home or office. And they have chosen XMOS to handle the ReSpeaker microphone array and deliver far-field voice capture with enhanced acoustic DSP performance.

As a KickStarter product there are hurdles to overcome before the product is available but by raising three times the initial goal of $40k in a week from more than 1k backers this is a product that people really want to get their hands on. With WiFi access to online services like Microsoft Cognitive Service Alexa Voice Service, Google Speech API, Wit.ai and Houndify, and on-board SD card slot for storage and the integrated lightweight PocketSphinx speech recognition engine, developers will have a huge range options for adding voice-interfaces to a variety of existing and new products.

Running on the Linux-based OpenWrt system with easy to use SDK, developers can build modules that work with the ReSpeaker module and share their work with other developers. Find out more at the ReSpeaker KickStarter page - this may be the start of a voice interface revolution.

GO TO RESPEAKER KICKSTARTER PAGE

↧

Will 2016 be a landmark year for lossless hi-resolution streaming audio services?

September 3, 2016, 3:19 am

≫ Next: ISO 9001 Recertification Success

≪ Previous: ReSpeaker: Modular hardware platform adds voice interfaces to any product you like

With rumours of upgrades to the iPhone's audio capabilities set to be announced next week at the iPhone 7 launch, let's look at how this could affect the Hi-Res Audio sector.

It's fairly safe to say that the way we consume music has changed significantly over the last five years. The CD is clinging on, but if Google Trends (see graph below) is accurate this is mostly as Christmas presents. Vinyl has made a comeback among audiophiles. And streaming services have really taken off; Spotify alone receives a greater proportion of Google searches than CDs, overtaking it for the first time in March 2015 and staying ahead for all except the Christmas period.

Google Analytics: Vinyl vs Spotify vs CD

Figure 1: The fall and rise of audio formats CD (yellow), Spotify (red) and vinyl (blue).

And, at the end of last year, revenues from streaming overtook revenues of physical formats, hitting $1 billion for the first time.

But Spotify streams audio tracks at no more than 320 kbps (and often 96 / 160 kbps). While this is acceptable for many consumers who use mobile devices to consume music, it's poses a problem for many people who want to stream music into their homes and listen to high quality recordings. These people need a Hi-Res Audio quality audio stream.

Lossless or lossy codec?

Let's take a look at how easy is it to tell the difference between lossless codecs (typically operating at a minimum of 96 kHz at 24 bits, and ideally 192 kHz at 24 bits, which equates to 9,216 kbps if uncompressed) and the lossy (albeit high-bit-rate) of MP3s from Spotify.

For a large proportion of people telling the difference between lossless codecs and lossy (albeit high-bit-rate) MP3s is not as easy as you might expect. Like fine wine, some people (either through nature or nurture) simply cannot hear or find it hard to hear the distortions and errors that come with lossy formats such MP3, indeed only half can hear that extra layer of depth and the encoding errors that come with lossy formats.

One of the best blind tests on this topic comes from the Archimago audiophile blog - receiving results from 151 people via an online survey. It showed that half of all people tested found it hard (34%) or impossible (16.7%) to tell the difference between 320kbps and a lossless format when listening to three songs encoded in each format. But this means that half of the participants can (or at least say they can) tell the difference. And for these people that can discern between lossless and lossy audio, the desire to seek out high-res files and streaming services - which reproduce the full range of sound from recordings mastered from better-than-CD sources - is high.

Meridian Audio MQA

A key event for the future of Hi-Res Audio happened December 2015, when Meridian Audio announced availability of the MQA (Master Quality Authenticated) lossless codec, which delivers Hi-Res Audio at a bit-rate that can be streamed easily, by encoding several optimised versions in a single file; the output device can select the best option available depending on the device capabilities or the user's data connection - an essential feature for people streaming to their phones or homes. As TechRadar put it, MQA is about "brilliant quality tunes in tiny file sizes".

MQA is generating a lot of interest from hardware manufacturers, record labels and service providers, who all see potential benefits to their product lines. For example Pioneer and Onkyo both have players that already support the MQA codec. As for phones and tablets, LG, Sony, HTC and Samsung can already play hi-res FLAC files, and HTC has publicly demonstrated a prototype phone with MQA decoding.

The Warner Music Group was the first major record label to sign a deal with Meridian that will see the label offering music in MQA format, while some independent labels have already made their music available in the format.

And then there's the streaming service Tidal, which has also said that they will support MQA. This has led to speculation that a major technology company, in this case Apple, might look at acquiring Tidal for its hi-resolution audio capabilities. As Techradar put it:

"When reports emerged that Apple might be planning on buying streaming service Tidal many speculated that the deal was to do with acquiring the rights to Tidal's exclusives.
"Over the past year the streaming service has racked up an impressive roster of exclusive albums, from Beyonce's Lemonade to Kanye West's The Life of Pablo, and an acquisition by Apple would secure these exclusives for its own streaming service.
"But more interesting is Tidal's agreement with MQA to use its high-resolution music format."

Will iPhone 7 support Hi-Res Audio?

Speculation is rife about what new features will be announced next month at the iPhone 7 launch. Rumours include 4K screens and 1 GBps wireless standards, but one of the most credible is the removal of the headphone jack and, with it, a potential upgrade of the audio DAC.

Current iPhones can't handle hi-resolution audio, however an upgrade to the DAC would allow it to not only play the higher quality music, but also transmit that audio stream via AirPlay to Apple TV boxes and other devices around the home. The fact that Tim Cook has previously highlighted that the fifth generation Apple TV (likely to launch at roughly the same time as the iPhone 7) is set to be "something bigger", makes Hi-Res Audio support a credible rumour.

Conclusion

There is already a demand for hi-resolution audio, but as consumers stream more audio into their homes using devices like Bluetooth speakers as well as mobile phones that demand will increase. MQA means that these consumers do not need to make a compromise between the convenience of MP3 and the quality of lossless audio files - even over a phone's 3G connection.

The iPhone 7 launch may give us some indication of how important Apple sees Hi-Res Audio to their ambitions of providing the central hub in people's homes. I doubt they will launch support for Hi-Res Audio streaming immediately but there should at least be some indication within the hardware specifications or product announcements that Apple expect to enable it in future. Apple still have issues with Apple Music which was launched in June 2015, and would want to make sure that was working before they integrate another technology.

If Apple do live up to expectations and include an upgraded DAC in the new iPhone 7 to match that of Android rivals, and then also do go ahead with the rumoured acquisition of Tidal, they will be in a very strong position to dominate the Hi-Res Audio market.

↧

ISO 9001 Recertification Success

November 15, 2016, 6:22 am

≫ Next: Smart Microphones Revealed

≪ Previous: Will 2016 be a landmark year for lossless hi-resolution streaming audio services?

It's not the most exciting or glamorous achievement for a small high tech company like XMOS, but nevertheless we’re pleased to have been recertified to the ISO 9001 management system standard. It’s something of a "behind the scenes" job and testament to our commitment to manage the business in a consistent and professional manner, and hopefully provide the best service possible to our customers.

We were assessed for ISO 9001 by BSI, who are an internationally recognised standards and certification body; over the next year or so we’ll transition from the 2008 revision to the most recent 2015 revision.

You’ll find our certificate on the About page: http://www.xmos.com/about/company

↧

Smart Microphones Revealed

November 22, 2016, 6:20 am

≫ Next: XVF3000/3100 VocalFusion Explained

≪ Previous: ISO 9001 Recertification Success

With the explosive growth in IoT products, especially in smart homes, we've reached an inflection point in the way that we communicate with these embedded systems. We're all used to having "an app for that", but with tens or hundreds of smart devices predicted for the home and offices, that paradigm simply doesn't scale; we need to embrace a more "universal" interface – voice.

One of the key components of voice interfaces is the smart microphone, which combines voice capture from one or more microphones within the system into a stream. This stream can be processed to create a highly directional microphone using direction-of-arrival and beamforming DSP algorithms, while echo cancellation, noise suppression, gain control and a range of other audio DSP capabilities are used to isolate the target speaker from unwanted noise. The captured, processed voice samples are then securely transferred to an embedded applications processor for further processing or to an automated speech recognition system (ASR) in the Cloud.

What is a beamforming microphone?

The term "beamforming" refers to the process of extracting the sound from a specific part of the space. This is normally a plane or cone shaped segment of space, but could be more specific. The quality of the beam is described in how much of the signal outside the beam is dropped (directivity pattern), and how well it can be steered; i.e. changing the section of space that we listen for. Beamforming microphones focus on a speaker's voice by steering the directivity pattern towards the sound source.

How is voice-activity detection achieved?

Voice-Activity Detection (VAD) is typically a three-stage process: first, any noise present in the system may be removed using spectral subtraction. Then feature extraction (typically proprietary) is used to drive a classification function. This can be as simple as a presence/absence threshold, but is often a complex feedback system. For instance VAD can also be used to provide feedback on the nature of the speech i.e. voiced, unvoiced or sustained.

What DSP algorithms do smart microphones use?

In addition to beamforming and VAD, smart microphones can use a number of other DSP techniques to improve the voice quality.

Noise suppression can be used to identify and compensate periodically varying signals, usually via spectral subtraction.

De-reverb is the process for eliminating room reflections and the associated echoes. In smart microphones it can effectively bring the speakers voice 'closer' to the microphone.

Acoustic echo cancellation (AEC) is a signal processing technique to remove echo generated by local audio loop back within a device when a microphone picks up audio from an integrated loudspeaker(s). It can be implemented as full duplex where the microphone simultaneously picks up a voice and background audio signal and the audio is removed to just leave voice signal, or as half duplex where only the voice or audio signal is present at any one time.

How is latency minimized in XMOS?

Voice controllers are time-critical, with system latency directly affecting the user experience – voice interfaces must be fast, smooth and responsive. Within XMOS systems latency is minimized by removing all the architectural elements that introduce non-determinism into an embedded system: no caches, no interrupts, no complex bus structures, and most importantly no embedded operating system in XMOS solutions. XMOS delivers very, very low latency systems, which allows voice controllers to operate at a much finer time quanta than conventional embedded systems.

Are smart microphones always listening?

Always-on performance has driven the development of thin-client systems, where most of the heavyweight processing is done in the Cloud. The smart microphone will be based on a highly integrated solution, such as an XMOS processor, or multiple discreet low-power microphone interfaces, DSPs and microcontrollers or application processors, to achieve the same effect.

Always-on smart microphones use technologies such as wake-on-keyword to stay in a deep sleep mode until triggered. Once awake the microphone passes the captured voice signal to the ASR services for processing.

How will smart microphones evolve over the next five years?

As smart microphone use takes off, we'll see greater levels of integration as vendors attempt to reduce system cost and power consumption. The XMOS architecture is ideal for voice controller implementations, but there will also be a new drive towards ultra-low power controllers for smart microphones as the market matures. The battle between thin and thick-client models is unlikely to be resolved, which will result in additional performance being required in the voice controller to support increased local ASR processing requirements.

Mark Zuckerberg has been very forthright in his desire to create a digital assistant – "Jarvis"– akin to the omnipresent assistant in the "Iron Man" films; the ubiquity offered by voice interfaces will accelerate our progress towards such a vision. The ability to add context to voice queries – offered by the most sophisticated ASR systems – will enable us to rapidly search for goods and services, far outstripping the capabilities offered by current browser-based search. Similarly, the ability to control or program home automation or entertainment systems with a voice interface will usher in the era of "no-interface" devices.

FIND OUT MORE ABOUT XMOS SMART MICROPHONES

↧

XVF3000/3100 VocalFusion Explained

July 4, 2017, 2:20 am

≫ Next: What magic will the Setem team bring to XMOS voice interfaces?

≪ Previous: Smart Microphones Revealed

XMOS has recently launched the XVF3000 family of far-field voice capture and processing devices. Let's take a quick look of the devices, what they do and how you could use them to enable a voice interface in your next product development.

There are two members in the family today: XVF3000 and XVF3100. Both devices provide the same voice capture and acoustic processing, with the XVF3100 also integrating keyword recognition – more on that in a moment.

The devices are based on XMOS' powerful and flexible xCORE microcontroller architecture, running a firmware that implements the voice capture, extraction and processing DSP algorithms.

Testing, testing, 1 2 3

In operation, the device continually captures audio from an array of four omnidirectional digital microphones, via four pulse density modulation (PDM) inputs. Two microphone topologies are supported: circular and linear. With the microphones arranged in a circle, voice can be captured and isolated from any direction in the full 360° space around the microphone array; ideal, for example, for a 'smart speaker' located in the centre of a room. Alternatively, when arranged linearly, voice can be captured and isolated from a 180° arc; perfect, for example, in a wall mounted panel application.

Cleaning up with voice DSP

The devices run a series of DSP functions on the captured microphone signals to identify and isolate voice content:

Acoustic echo cancellation (AEC) is applied to each microphone signal to remove the playback signal. (In this context 'echo' refers to the playback signal as heard by the microphones.)
An adaptive beamformer identifies and isolates any voice content in the listening space by focussing the microphones on to the person speaking, so cutting out other noise and keeping their voice clear.
Dynamic de-reverberation removes room echoes and noise suppression removes other general background noise, so cleaning up the voice signal to make it easier to understand.
Finally, automatic gain control (AGC) can be used to maintain the extracted voice signal at a useful level, meaning the person listening at the other end through the communications channel can always hear the person speaking.

In use, these features enable full-duplex operation; you can talk-over the playback music in a speaker application, or 'barge-in' and talk over other parties in a conferencing application.

Keeping connected to the rest of the system

Uniquely, the VocalFusion devices can provide two voice output channels: a 'communications' output and a 'automatic speech recognition (ASR)' output. The communications output is optimised for the human ear, whereas the ASR output has less processing applied (specifically, less non-linear effects) and so is optimised for streaming to speech recognition engines. The devices can be configured to provide one output, both outputs simultaneously, or to dynamically switch between them. XVF3000 devices provide flexible connection options, with audio connectivity via High-Speed USB 2.0 (USB Audio Class 1 device) or an I2S interface, and control via the same USB or I2C.

Keyword recognition

The XVF3100 also includes Sensory's TrulyHandsfree keyword recognition - the industry's leading keyword recognition solution. Keyword recognition allows you to trigger an activity when a specific word/phrase is 'heard'. Often this activity is to stream the voice signal to a third-party speech recognition engine for subsequent processing and action. Here, knowing when to stop streaming the voice signal is also important. The XVF3100 therefore includes a voice activity detector (VAD) which can use used to indicate when the person has finished talking and so, in this example, stop streaming the voice signal.

Tell me more!

The XVF3100 and XVF3100 are available now. Watch a demo or visit voice.xmos.com to find out more about XMOS voice solutions.

UPDATED: This article has been updated with a new pipeline image and information about AGC support, 25 June 2018.

↧

What magic will the Setem team bring to XMOS voice interfaces?

September 26, 2017, 3:49 am

≫ Next: How voice user interfaces will feature in CES 2018

≪ Previous: XVF3000/3100 VocalFusion Explained

Voice interfaces are still in the early days of an evolution that will radically change how people interact with technology. As product designers add voice interfaces to products and look for opportunities for new categories of no-interface products, they’re learning more about the complexities of voice capture, particularly in the far-field (3-5m from the microphone).

The Problem

One of the main problems is the ability to selectively extract specific voice sources from competing conversations, reverberation and point noise sources like TVs and sound systems – a problem often referred to as the "cocktail party problem".

While traditional beamformer technologies deliver robust performance in many environments, particularly when coupled to other technologies that allow the microphone to differentiate between human and synthesized voice, they generally default to the loudest sound source and can struggle to identify specific sound sources in noisier environments, like a house full of kids.

Our Solution

In July, XMOS acquired Setem Technologies, pioneers in a new type of Blind Source Separation technology. Kevin Short, founder and CTO of Setem Technologies, and now Chief Scientist of XMOS, and the team at XMOS Boston office are developing a solution to this challenge that uses a revolutionary combination of signal processing and machine learning techniques based on voice biometrics. By dissecting and reconstructing the sound field in the time, frequency and spatial domains, in real-time, microphones can extract individual voices from background noise and eliminate reverberation, producing much greater clarity than existing source separation technologies.

A key feature of the modelling process is the ability to identify all sound sources in a space, to identify voice, and focus on one or more of those sources, making the technology flexible and powerful.

A typical scenario is the ability to extract the speech of a driver or any other passenger in a car, as a clear audio stream optimized for automatic speech recognition (ASR) systems, by separating out and rejecting road noise, engine noise, audio system and general conversation within the vehicle.

Kevin and the team have been working with xCORE technology since before the acquisition - for more than 18 months, and concluded that the architecture is the perfect match for the patented algorithms due to the high performance, integrated I/O and DSP capabilities.

The combination will lead to highly integrated solutions that deliver very high quality, voice interface controllers that solve the "cocktail party problem" while making it faster and easier to deliver voice-enabled products to market. We call it VocalSorcery.

↧

How voice user interfaces will feature in CES 2018

December 21, 2017, 12:00 am

≫ Next: Choosing an Acoustic Echo Canceller for voice-enabled smart home products

≪ Previous: What magic will the Setem team bring to XMOS voice interfaces?

The recent rise of the smart speaker has been exceptional, and all signs suggest the dominance of Amazon's Alexa and to a lesser extent Google Home is set to continue at CES 2018. Looking at the increase in searches made for these devices (see graph below) we can see how demand has grown and that we're now at a point where there is exceptional competition between the major players. And this is before Apple's HomePod, which starts shipping early next year, joins the fray.

Three years ago we were already at the point where 55% of teens and 41% of adults use voice search more than once per day on their mobile phones (source Google Survey). The rise of the Echo and Google Home has only increased this.

The work that we've been doing with Amazon Alexa Voice Service (our VocalFusion 4-mic technology is the first - and currently only - linear far-field mic array qualified by Amazon for the Alexa Voice Service) and other Tier 1 partners worldwide suggests that these devices are just the tip of the iceberg. We firmly believe that while there will be a plethora of similar devices, the real and very considerable growth is going to come from innovation, not replication.

Moving to the edge of the room

2018 will see voice technology integrated into a greater range of systems than just smart speakers. This is partly a natural step as AI chips and edge computing improve; but also as a consequence of the need to integrate these into multiple rooms, and locate them in positions where they can be used, regardless of where the voice user interface is and the direction they're facing.

Take a closer look at almost any advert for a smart speaker, from any vendor, and more than likely you won't see a power cable, and it will be positioned in the centre of a room - a coffee table, or a kitchen counter, far away from a power socket.

By integrating smart far-field voice user interfaces into multiple wall-mounted and edge-of-room systems, the technology can be more easily brought into the home, and added to products we already use. Will we see a smart dishwasher or washing machine with a voice UI? Probably not yet, but there is already a smart thermostat with built-in Alexa Voice Service, the ecobee4. Will we see a voice enabled TV, soundbar or media streamer? Absolutely because the technology is now available to add voice interfaces to products with stereo output channels.

What you can expect to see from us at CES

During CES in January, XMOS will be demonstrating our latest voice interface technology. Our technology will also be in evidence on several partner stands on the show floor, for example the XMOS XVF3000 voice processor is used in both the Bocco robot on the Yukai Stand (64601 in "Creative Vision, Japan Tech".

VocalFusion Stereo Evaluation Kit (XK-VF3500-L33),
with VocalFusion XVF3500 Voice Processor

We'll be showcasing our new VocalFusion stereo linear kit, the world's first stereo-AEC linear far-field voice kit for smart devices. This uses the new XVF3500 voice processor, which delivers 2-channel full duplex acoustic echo cancellation (AEC) - targeted at OEMs working in smart TVs soundbars, set-top boxes and digital media adaptors.

We'll also be showing our VocalFusion 4-Mic Kit for Amazon AVS along with the XVF3100-based VocalFusion Linear Kit, both designed for the edge of room segment for integration into smart panels, kitchen appliances and industrial equipment.

Looking forward, there'll also be demonstrations of our VocalSorcery advanced blind signal separation, which delivers voice capture from individual people within noisy environments like a café or car, and a joint proof of concept with Infineon that blends data from radar and microphones with XMOS voice capture technologies to deliver a more seamless human machine interface that enables greater context for the systems.

To arrange to meet the XMOS team visit xmos.com/news/events

↧

Choosing an Acoustic Echo Canceller for voice-enabled smart home products

June 8, 2018, 12:32 am

≪ Previous: How voice user interfaces will feature in CES 2018

Do you want voice-enabled products that can hear commands from across the room?
You’ll need the right acoustic echo cancellation solution.

If you're designing a voice-enabled product for the smart home that includes a loudspeaker, you'll need to remove the acoustic echo it generates so you can interrupt the audio stream – barge-in – and give a voice command when the device is playing such as adjust volume.

Mono or stereo?

For products such as security solutions or kitchen appliances, and many smart speakers, mono acoustic echo cancellation is usually the right tool for the job. But if you're designing products that output true stereo audio, for example TVs, soundbars and media streamers, then you'll need stereo-acoustic echo cancellation to secure the best performance available. Here's why.

Acoustic echo cancellation explained

Acoustic echo cancellation is a digital signal processing technique for removing echo that originates from a loudspeaker.

Within a device, there's a direct path between the loudspeaker and microphones. There's also an indirect path between the two, because the audio signal reflects off the walls and other surfaces before it reaches the microphone. Looked at in simple terms, you'll get a reflection off the ceiling, floor, each wall and every solid object in the room. These reflections are known as indirect acoustic echo and they're picked up at different times by the microphone, depending on the length of path from the loudspeaker to the microphone.

If we consider the soundwave generated by a noise from the loudspeaker, the original sound can usually be identified at the beginning and then the soundwave tails off as the energy falls in reflections.

To support barge-in and capture a clear voice stream to send to your automatic speech recognition service (ASR), you need to remove as much echo from the captured microphone signal as possible.

It's not possible to remove 100% of the echo because the time needed to capture the signal and separate out all of the echo would lead to a delayed response, and the user experience demands that this all happens in real time. So in practice, you're looking to target an “acceptable” level of echo cancellation that allows the ASR to respond accurately.

Types of acoustic echo cancellers

Echo cancellers are categorised by the number of loudspeaker reference channels supported. Common configurations are either: mono - 1-channel, or true stereo - 2-channel. Another configuration - pseudo-stereo– behaves in a very similar way to mono, but has some significant performance issues when challenged with true stereo audio output.

Mono-AEC

Mono-AEC uses a single reference signal based on the audio input and applies it to the output, which can be one or more loudspeakers.

The DSP uses the reference signal to calculate indirect echo based on the time it takes the reflections to reach the microphone.

Where signal processing has been used to give the impression of a stereo system from a mono signal (e.g. by adjusting the signal pan and volume and output to two or more speakers) the calculation remains based on the reference signal and position of the loudspeakers from the microphone:

True Stereo-AEC

True stereo-AEC uses two separate reference signals based on the two-channel input.

Each reference signal is used to cancel the echo from its corresponding loudspeaker output.

True stereo-AEC requires almost twice the computational resources of a mono solution, and it requires very low latency within the system to keep all the echo cancellation synchronized within the required thresholds.

Pseudo-Stereo-AEC

A pseudo-stereo solution is similar to a mono-AEC configuration; it outputs the two audio streams to separate speakers but uses a single reference signal that is a mix of the two inputs.

The mixed reference signal is then applied to each loudspeaker output.

Problems arise when the mixed signal differs significantly from the two output channels, for example a loud track on one loudspeaker and a quiet one of the other, and the mixed reference signal is not representative of either input signal.

In the example above the amplitude of the reference signal is significantly larger than the output for Input A. This causes the signal to be drowned out leading to a very low signal-to-noise for the voice capture process. With Input B there is not enough AEC when the input is loud which will cause increased artefacts in the captured voice stream and a higher likelihood of inaccurate word recognition.

Choosing the right acoustic echo cancellation solution

The start point is to decide which acoustic echo canceller you need for your microphone array and audio subsystem.

Using a mono-AEC algorithm with a true stereo device will only work if both channels are very similar. If your stereo product uses the full capabilities of stereo audio with spatial soundscape and dramatic volume changes, then the only solution is one that supports true stereo-AEC.

For devices like smart speakers where the required range of output is more limited, a pseudo-stereo may provide an good solution.

And for things like kitchen appliances where high quality audio isn't required, mono-AEC is ideal.

XMOS has a range of solutions to fit whatever product you're developing. Our XVF3000 series with mono-AEC is ideal for smart panels and smart speaker developers, while our XVF3500 series with two channel stereo-AEC delivers outstanding performance for smart TVs, soundbars and other products that playback true stereo output.

↧

What else is going on?

Voice capture requires high quality hardware

The voice interface revolution has started

A Technological Tower of Babel

The train has left the station

Time to catch up

xCORE-VOICE - The start of the revolution

Voice interfaces are evolving

The barriers are starting to fall

Microphone arrays

How complex are microphone arrays?

Why do you need acoustic DSP?

Taking control of voice interfaces

Don't forget the AI connection

Putting it all together

Conclusion

Tidal

Qobuz

Deezer

OraStream

Apple Music

Conclusion

Lossless or lossy codec?

Meridian Audio MQA

Will iPhone 7 support Hi-Res Audio?

Conclusion

What is a beamforming microphone?

How is voice-activity detection achieved?

What DSP algorithms do smart microphones use?

How is latency minimized in XMOS?

Are smart microphones always listening?

How will smart microphones evolve over the next five years?

Testing, testing, 1 2 3

Cleaning up with voice DSP

Keeping connected to the rest of the system

Keyword recognition

Tell me more!

The Problem

Our Solution

Moving to the edge of the room

What you can expect to see from us at CES

Do you want voice-enabled products that can hear commands from across the room? You’ll need the right acoustic echo cancellation solution.

Mono or stereo?

Acoustic echo cancellation explained

Types of acoustic echo cancellers

Mono-AEC

True Stereo-AEC

Pseudo-Stereo-AEC

Choosing the right acoustic echo cancellation solution

Do you want voice-enabled products that can hear commands from across the room?
You’ll need the right acoustic echo cancellation solution.