Scott Zuccarino

I recently became curious about what it’d be like to build a product around voice and how accessible voice recognition is to hobbyist developers. As a result, I decided to build a browser-based version of Alexa / Google Home, called Voice Command.

Voice Command is a Chrome extension that lives on your browser’s new tab page. It accepts voice commands and executes them in the browser. Example commands that are supported include:

What’s today’s weather?
CNN
Play The Beatles
Search for El Toro’s address
New document
New spreadsheet
Close all tabs

The extension does not log any queries out of respect for the half dozen people who use it.

Technical overview

Voice Command is powered by a React frontend, an AWS Lambda API, and a few public APIs (e.g., YouTube’s data API, Bing’s search API). Astonishingly, the browser handles the majority of heavy lifting for speech recognition, as it turns out that there’s a browser specification for voice-to-text transcription. Back in 2013, Chrome introduced a Speech Recognition API into the browser. (Other browsers have not yet followed suit.)

User experience

Voice Command would obviously be useful to someone who feels uncomfortable typing or who cannot type.

However, as someone who does feel comfortable typing, the main use-cases where Voice Command is useful to me are when invoking multi-step commands (e.g., ‘new doc’), or when using the computer with my wife. The first use-case makes sense (faster is better), but the second was a mild surprise and helped me realize how much of a solo activity using a keyboard is. If you’re using a computer with someone, one of you has to grab control of the keyboard to get anything done. That action immediately designates one person as the user and the other as the follower. However, when speaking to a computer, it’s clear that both people are on equal footing, which feels more comfortable when accomplishing tasks jointly.

The Future

Speech recognition is now at human-level accuracy for many tasks, and progress will only continue. In 2016 and 2017, Microsoft announced a speech transcription system that achieved error rates that are at parity with humans, with word error rates as low as 5.1%. In other words, we are now able to correctly transcribe 19 out of 20 words. These new error rates are low enough to unlock new product use-cases, and in fact, they have: as of 2019, Amazon has sold more than 100M units of Alexa.

I expect that voice will continue to work its way into our computing experience, but would be surprised if it completely supplanted typing as our primary interface. Voice technology will likely emerge as a way to control computing devices in domains where typing is inconvenient and where shared control of the computing device is useful. The in-car experience is an example of a domain where voice might become dominant (that is, if self-driving cars don’t arrive first). Drivers have little ability to type or tap, and similarly, might like to extend control of music, air conditioning, or heat to other passengers. Similarly, voice products will likely become extremely popular with children, who may not have mastered typing yet.