You might have many reason to do speech-to-text (STT) transformations locally - privacy, you have custom-trained models, or maybe you just don’t need the latency that comes with online services. I have a podcast, that I want to transcribe and generate captions for, and I wanted to do that blazingly fast. One of the choices for STT might be DeepSpeech - a library developed by Mozilla that does just that. More than that, it comes with a pre-trained English speech model that you can start using right away.
Here is the problem, though. As I started exploring the library, I realized that it had Windows builds, but no concrete instructions on how to get things running on the OS. My primary machine is no longer UNIX-based, so I had a personal interest in getting it working properly - I could finally put my RTX 2080 to good use.
It all starts pretty trivially, as outlined in the official instructions:
- Create a Python virtual environment.
- Install the
deepspeech-gpupackage (if you don’t have a beefy GPU, no worries - just use
- Make sure that the right version of CUDA and the associated CuDNN are installed.
Easy, right? Or so I thought. When I fed my WAV file through DeepSpeech, as follows:
This produced no results whatsoever, despite the fact that I could clearly see my GPU doing the heavy lifting.
This was puzzling, because in theory I should have some kind out output, but I got none. If I sliced the file into smaller chunks, I could see that the transcription output appeared in the terminal just fine.
My first hunch would be that I am using a file that is just too big (30+ minutes). Because I am running everything out of a PowerShell console, I took a peek at
$LastExitCode and got:
That’s the signed integer representation of
0xC00000FD, which is none other than a
STATUS_STACK_OVERFLOW (more on that in Microsoft Docs). Yikes, so maybe the file is too big.
But that’s when I learned that there is a variant of DeepSpeech built as a native client, that supports a
--stream argument. This should give me a hint on whether the text is properly identified or not as it is passed through the model. I tried that by downloading the latest native client (available as both a CPU and CUDA options), extracted the contents, and ran:
Output came very fast, and empty again - doesn’t seem like much happened at all this time. What could be wrong? Well, as it turns out - someone else had the same issue as I did, and the problem was hiding in the format of the WAV file. The way I was converting my WAV from stereo to mono 16-bit and 16 kHz,
ffmpeg was appending LIST-INFO metadata, and the DeepSpeech client is very particular about the WAV files it processes. No big deal, that can be addressed by converting the file as such:
I ran the same DeepSpeech native client command, and lo-and-behold, I actually started getting the output from the large file. But a couple of minutes in, the process exited once again, with a familiar:
Looks like the CLI is really not intended for large file processing. Reading through the DeepSpeech forums, I’ve stumbled (on several occasions) across a statement that said that the CLI is for demo purposes only.
I guess I will need to write the code. I quickly whipped up a new Console (.NET Framework) application in Visual Studio, added the necessary libraries (I’ll have a separate blog post for the actual C# sample soon), and wrote this masterpiece:
With the CPU (AMD Ryzen 9 3900XT 12-Core Processor) library, it took about 24 minutes to generate the transcript. With the GPU (NVIDIA GeForce RTX 2080 SUPER) library - 27 minutes. I do need to write a better sample snippet and explain it in a future blog post, because taking one giant file and waiting for it to complete is not really sustainable, but it works for now.
That was my adventure running DeepSpeech on Windows. It’s a great project in a dire need of documentation, samples, and explanation for how and why to get things working on Windows. I hope to add some of that context here on my blog.