Speech-to-text with Parakeet 0.6b v2
May 12th, 2025
Nvidia announced their Parakeet-TDT-0.6b-v2 model last week and I was immediately intrigued.
It’s a relatively small 600 million parameter model which boasts some impressive stats: a 6.05% Word Error Rate (WER) and a ludicrous Real-Time-Factor (RTF) of 3386. This means the model can transcribe 60 minutes of audio in just one second on an A100 - an approximately $10,000 USD GPU. That’s fast. So fast, in fact, that it has taken first place on Hugging Face’s Open ASR Leaderboard.
The model is released under a Creative Commons Attribution 4.0 license, making it suitable for both commercial and non-commercial use. Perhaps a key limitation for wider adoption, though, is that it’s currently English-only.
Running on Apple Silicon with MLX
Parakeet was unsurprisingly optimised for Nvidia’s GPU architectures. However, I do most of my daily work at the moment on a M3 MacBook Pro with 36GB of unified memory. So I was delighted to discover that GitHub user Senstella had ported the model to Apple’s MLX framework in their parakeet-mlx repository.
My understanding is that this 0.6B parameter model requires a minimum of 2GB of unified memory, which is impressively small and makes it accessible on even lower-spec macs. There are users who have posted on the repository showing successful runs on 8GB MacBook Airs.
Installation
I installed parakeet-mlx
using pipx
, my preferred tool for installing Python CLI applications in isolated environments. One day I’ll find the time to move to uv
.
pipx install parakeet-mlx
pipx
handles creating an isolated environment, installing the package and its dependencies, and adding the CLI tool to my PATH.
Of course, things are never that simple. I immediately ran into this error when trying to use the CLI:
ModuleNotFoundError: No module named '_lzma'
.
This is a classic Python-on-macOS-via-pyenv issue. It usually means the development libraries for LZMA weren’t present when Python itself was compiled. Since I use pyenv to manage my Python versions, the fix involved:
- Installing the
xz
package (which provides LZMA libraries) via Homebrew. - Reinstalling my Python version so it could pick up the newly available libraries.
brew install xz
pyenv uninstall 3.12.2
pyenv install 3.13.3
pyenv global 3.13.3
I took the opportunity to bump to Python 3.13.3 while I was at it.
Transcribing
The parakeet-mlx
CLI makes it straightforward to transcribe a file:
parakeet-mlx <audio_file>
By default, it outputs a srt
subtitle file, but you can specify txt
or json
using the --output-format
flag.
To put it through its paces, I fed it a MP4 video file that was 1 hour and 8 minutes long. Even with a mess of other applications open (including a small Linux VM chugging away), the transcription completed in just 1 minute and 2 seconds. This is dramatically faster than running something like whisper large locally on my M3 for the same file.
One of the touted capabilities of this model is song-to-lyrics transcription. I was curious about this, as I’d been experimenting with generating AI music videos last year. I ran a copy of one of those videos through parakeet-mlx.
The transcription was… okay. Not terrible, but not amazing either. Google’s Gemini models actually produced a more accurate transcription. To be fair, this is a sample size of one, and music transcription is notoriously difficult.
Two notable features are currently missing (for now):
-
Diarization: The ability to distinguish between different speakers. This is a common request for ASR models used for meetings or interviews. The only option right now is to do some kind of post processing.
-
Built-in Streaming: While the parakeet-mlx port doesn’t have a direct audio streaming feature yet, it is listed as a to-do item on the GitHub repository.
A quick streaming test
The model’s speed is its golden feature. It clearly processes audio much faster than it arrives in real-time, which is the fundamental prerequisite for streaming transcription.
Implementing robust, low-latency streaming with accurate voice activity detection (VAD) and intelligent merging of text segments is non-trivial. But, I did put together this quick Python script using the sounddevice
library to capture audio from my microphone in 5-second chunks and feed it to the parakeet-mlx library:
import sounddevice as sd
import numpy as np
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import (
get_logmel,
)
MODEL_ID = "mlx-community/parakeet-tdt-0.6b-v2"
MODEL_DTYPE = mx.bfloat16
CHUNK_SECONDS = 5
print(f"Loading model: {MODEL_ID}...")
model = from_pretrained(MODEL_ID, dtype=MODEL_DTYPE)
TARGET_SAMPLE_RATE = model.preprocessor_config.sample_rate
print("Model loaded.")
def audio_callback(indata, frames, time, status):
if status:
print(status)
audio_np = indata[:, 0]
audio_mx = mx.array(audio_np, dtype=MODEL_DTYPE)
try:
mel_chunk = get_logmel(audio_mx, model.preprocessor_config)
mx.eval(mel_chunk)
results_list = model.generate(mel_chunk)
mx.eval(results_list)
if results_list:
result = results_list[0]
if result.text.strip():
print(
f"[{len(audio_np)/TARGET_SAMPLE_RATE:.2f}s chunk] Transcribed: {result.text.strip()}"
)
except Exception as e:
print(f"Error processing chunk: {e}")
try:
with sd.InputStream(
samplerate=TARGET_SAMPLE_RATE,
channels=1,
dtype="float32",
blocksize=int(TARGET_SAMPLE_RATE * CHUNK_SECONDS),
callback=audio_callback,
):
print(
f"Streaming from microphone ({TARGET_SAMPLE_RATE} Hz, {CHUNK_SECONDS}s chunks)... Press Ctrl+C to stop."
)
while True:
sd.sleep(1000)
except KeyboardInterrupt:
print("\nStreaming stopped.")
except Exception as e:
print(f"An error occurred: {e}")
Running this script gave me output like:
(venv) ➜ parakeet-test python script.py
Loading model: mlx-community/parakeet-tdt-0.6b-v2...
Model loaded.
Streaming from microphone (16000 Hz, 5s chunks)... Press Ctrl+C to stop.
[5.00s chunk] Transcribed: Hello?
[5.00s chunk] Transcribed: Testing one, two, three.
[5.00s chunk] Transcribed: Hello?
It works! It’s obviously very basic:
- Each chunk is transcribed in complete isolation;
- There’s no voice activity detection, so it processes silent chunks too; and
- There’s no overlap or merging strategy for words that might span across chunk boundaries, which is a common technique for more accurate streaming.
Still, for a very quick and naive hack, it’s exciting to see it responding in near real-time.
What does this unlock?
Fast, local speech-to-text isn’t a completely new capability. I’ve been using various Whisper implementations (like whisper.cpp
) for a while now. What Parakeet-TDT-0.6b-v2, especially via this MLX port, brings to the table is a significant leap in speed for high-accuracy transcription on Apple Silicon.
This dramatic speed-up starts to shift ASR on my Mac from something that I might occasionally do when necessary and take a walk to stretch my legs while it processes, towards an instant utility layer that I can see myself engaging with more frequently.