It’s tough for an erstwhile Iron Man to work on creating their personal AI assistant on the weekends. Like any other time-pressured inventor without a PhD in computer science and linguistics, I decided to use a library for speech recognition and synthesis. Fortunately, Python offers several choices. Unfortunately, many of simply them don’t work any more. I will discuss the ones that are still functional and can be used with Python 2.7 and Python 3 (up to Python 3.5 at the time of writing).
Jarvis’s Mouth: Text-to-Speech
We’ve all heard of Google’s AI initiatives (maybe Larry Page aspires to Tony Stark status?), so it should come as little surprise that they offer a RESTful way to do voice recognition and speech synthesis. The Python library that nicely wraps their text-to-speech API is gTTS.
gTTS takes advantage of Google Translate’s voice capability and downloads its response to a parameterized GET request into an mp3 file (or in-memory file). However, you would need one of the pygame, pyglet + AVBin, or VLC Player python libraries to play that mp3. This is additional complexity and dependency bloat. Additionally, you need an API key to use Google Translate (no longer free because of the “substantial economic burden caused by unintended abuse”).
I also found pyttsx, which is a great offline option. JPercent has updated it for Python 3. Install JPercent’s version by downloading the repository and pip install from that local folder; otherwise
pip install pyttsx directly. For Windows, you need to install PyWin32 and make sure you have the Microsoft Speech API installed. If you’re using virtualenv, you must copy the win32 folders and the files with win32 in its name from your python install directory’s
Lib/site-packages to your virtual env’s
Lib/site-packages. Additionally, pyttsx has difficulty detecting the installed libraries on Windows 10 and must be manually initialized to use ‘sapi5’.
Jarvis’s Ears: Speech Recognition
SpeechRecognition is a wonderful, up to date library that offers to use CMU’s open source Sphinx project, Google services, or Wit.ai to convert audio input into text. If you intend to recognize microphone input, you’ll also need PyAudio. Sphinx is enormously powerful, but SpeechRecognition includes a wheel package for a simple language definition and handles all of the complexity for us. While Google speech recognition is a paid service, Wit.ai is free as of last January, when they were acquired by Facebook. I found Google’s offering to be of the highest quality, and you have 50 calls/day free with your developer API key. Sphinx’s voice recognition was by far the poorest and often garbled words, though still tended to produce useful recognized words in the middle and end of the phrase.
Jarvis’s Brain: The Code
I ultimately opted to use pyttsx and SpeechRecognition/Sphinx because they are offline and free, with great open source licenses. Whichever route you may choose, you can now assemble these libraries to create your own Frankensteinien AI assistant:
The code itself is straightforward, as anyone would hope after spending a couple of hours researching this on a weekend (and getting it to work on Windows, no less). Try it out!