Benchmarks

Hardware: 2024 MacBook Pro, M4 Pro, 48GB RAM, macOS Tahoe 26.0 (unless noted).

Transcription (Parakeet TDT v3)

25 European languages on FLEURS:

Language	WER%	CER%	RTFx	Files
Italian	4.0	1.3	236.7	350
Spanish	4.5	2.2	221.7	350
English (US)	5.4	2.5	207.4	350
French	5.9	2.2	199.9	350
German	5.9	1.9	220.9	350
Russian	7.2	2.2	209.7	350
Ukrainian	7.2	2.5	201.9	350
Dutch	7.8	2.6	191.7	350
Polish	8.6	2.8	190.2	350
Czech	12.0	3.8	214.2	350
Slovak	12.6	4.4	227.6	350
Bulgarian	12.8	4.1	195.2	350
Croatian	14.0	4.3	204.9	350
Romanian	14.4	4.7	200.4	883
Finnish	14.8	3.1	222.0	918
Swedish	16.8	5.0	219.5	759
Hungarian	17.6	5.2	213.6	905
Danish	20.2	7.4	214.4	930
Estonian	20.1	4.2	225.3	893
Maltese	25.2	9.3	217.4	926
Lithuanian	25.0	6.8	202.8	986
Latvian	27.1	7.5	217.8	851
Slovenian	27.4	9.2	197.1	834
Greek	36.9	13.7	183.0	650
Average	14.7	4.7	209.8	14,085

LibriSpeech (English)

Model	Dataset	WER%	CER%	RTFx	Files
TDT v3	test-clean	2.5%	1.0%	155.6x	2,620
TDT v2	test-clean	2.1%	0.7%	145.8x	2,620

v2 has lower English WER — use it if you only need English.

Model Compilation Times

First-load CoreML compile times (ANE compilation):

Model	iPhone 16 Pro Max (cold)	iPhone 16 Pro Max (warm)	iPhone 13 (cold)
Preprocessor	9ms	—	633ms
Encoder	3,361ms	162ms	4,396ms
Decoder	88ms	8ms	146ms
JointDecision	48ms	8ms	72ms

Cold start = first load after install. Warm = subsequent loads from ANE cache.

Custom Vocabulary Boosting

Earnings22 benchmark (771 files, earnings call transcripts with domain-specific terms):

Metric	Value
Average WER	15.0%
Vocab Precision	99.3% (TP=1068, FP=8)
Vocab Recall	85.2% (TP=1068, FN=185)
Vocab F-score	91.7%
Dict Pass (Recall)	99.3% (1299/1308)
RTFx	63.4x
Total audio	11,565s

Streaming ASR (Parakeet EOU)

Hardware: Apple M2, 2022, macOS 26. LibriSpeech test-clean (2,620 files, 5.4h audio):

Chunk Size	WER (Avg)	RTFx	Total Time
320ms	4.87%	12.48x	26min
160ms	8.29%	4.78x	68min

320ms is the recommended default — best accuracy/latency tradeoff.

Voice Activity Detection (Silero VAD v6)

VOiCES Dataset (25 files, clean speech)

Metric	Value
Accuracy	96.0%
Precision	100.0%
Recall	95.8%
F1-Score	97.9%
RTFx	1,230.6x

MUSAN Full (2,016 files, mixed noise/music/speech)

Metric	Value
Accuracy	94.2%
Precision	92.6%
Recall	78.9%
F1-Score	85.2%
RTFx	1,220.7x

Speaker Diarization

Offline Pipeline (VBx)

VoxConverse dataset (232 clips):

Config	DER%	JER%	RTFx
Step ratio 0.2, min duration 1.0s (default)	15.1%	39.4%	122x
Step ratio 0.1, min duration 0s (max accuracy)	13.9%	42.8%	65x

The default is ~2x faster for only ~1.2% worse DER. Use step ratio 0.1 for critical accuracy. Reference: pyannote community-1 on CPU is 1.5-2x RTFx, on MPS is 20-25x RTFx. FluidAudio on ANE is 65-122x RTFx.

Streaming Pipeline (AMI SDM)

Chunk	Overlap	Threshold	DER%	RTFx	Best For
5s	0s	0.8	26.2%	223x	Best accuracy/speed balance
10s	0s	0.7	33.3%	392x	Higher throughput
3s	1s	0.85	49.7%	51x	Lowest latency
5s	2s	0.8	43.0%	69x	—

5s chunks with 0.8 threshold is the recommended starting point for streaming.

Streaming diarization is 10-15% worse DER than offline. Only use streaming when you critically need real-time speaker labels. For most apps, offline is more than fast enough.

Sortformer (End-to-End Streaming)

Hardware: Apple M2, 2022, macOS 26.1. AMI SDM dataset, NVIDIA high-latency config (30.4s chunks):

Meeting	DER%	Miss%	FA%	SE%	RTFx
IS1009b	16.4	10.6	0.6	5.3	127.0
ES2004c	23.8	17.8	0.3	5.7	126.5
ES2004b	23.9	18.7	0.2	5.0	123.9
IS1009a	26.5	16.0	1.4	9.1	134.4
ES2004d	28.3	19.7	0.3	8.3	123.5
IS1009d	29.1	16.5	1.0	11.6	127.9
TS3003b	31.1	27.1	0.6	3.4	125.5
EN2002c	31.8	20.1	0.2	11.5	126.0
ES2004a	33.7	24.6	0.1	9.0	127.2
EN2002b	34.0	20.2	0.6	13.3	127.7
TS3003c	34.4	31.1	0.3	3.1	126.6
EN2002a	35.6	20.0	0.4	15.2	125.4
EN2002d	37.1	20.1	0.5	16.5	125.5
IS1009c	38.1	12.8	0.9	24.4	129.2
TS3003d	41.0	32.0	0.1	8.8	125.6
TS3003a	41.8	36.8	0.7	4.3	125.7
Average	31.7	21.5	0.5	9.7	126.7

Text-to-Speech

Comparison across frameworks generating the same text samples (1s to ~300s of output audio):

Kokoro 82M

Framework	Total RTFx	Peak RAM	Notes
PyTorch CPU	17.0x	4.85 GB	Known memory leak
PyTorch MPS	10.0x	1.54 GB	Crashes on long strings
MLX	23.8x	3.37 GB	—
Swift CoreML	23.2x	1.50 GB	Lowest memory

CoreML matches MLX speed with 55% less peak RAM. First run takes ~15s for ANE compilation, subsequent loads ~2s.

Running Benchmarks

# Transcription (all languages)
swift run -c release fluidaudio fleurs-benchmark --languages all --samples all

# Transcription (English, LibriSpeech)
swift run -c release fluidaudio asr-benchmark --max-files all

# Custom vocabulary
swift run -c release fluidaudio ctc-earnings-benchmark --auto-download

# Streaming ASR
swift run -c release fluidaudio parakeet-eou --benchmark --chunk-size 320 --use-cache

# VAD
swift run -c release fluidaudio vad-benchmark --dataset voices-subset --all-files --threshold 0.85

# Diarization (offline)
swift run -c release fluidaudio diarization-benchmark --mode offline --auto-download

# Diarization (streaming)
swift run -c release fluidaudio diarization-benchmark --mode streaming \
  --dataset ami-sdm --threshold 0.8 --chunk-seconds 5.0 --overlap-seconds 0.0

# Sortformer
swift run -c release fluidaudio sortformer-benchmark --nvidia-high-latency --hf --auto-download

# TTS
swift run -c release fluidaudio tts --benchmark

Getting Started

Speech Recognition (ASR)

Speaker Diarization

Voice Activity Detection

Text-to-Speech (TTS)

Guides

Reference

Transcription (Parakeet TDT v3)

LibriSpeech (English)

Model Compilation Times

Custom Vocabulary Boosting

Streaming ASR (Parakeet EOU)

Voice Activity Detection (Silero VAD v6)

VOiCES Dataset (25 files, clean speech)

MUSAN Full (2,016 files, mixed noise/music/speech)

Speaker Diarization

Offline Pipeline (VBx)

Streaming Pipeline (AMI SDM)

Sortformer (End-to-End Streaming)

Text-to-Speech

Kokoro 82M

Running Benchmarks

Getting Started

Speech Recognition (ASR)

Speaker Diarization

Voice Activity Detection

Text-to-Speech (TTS)

Guides

Reference

​Transcription (Parakeet TDT v3)

​LibriSpeech (English)

​Model Compilation Times

​Custom Vocabulary Boosting

​Streaming ASR (Parakeet EOU)

​Voice Activity Detection (Silero VAD v6)

​VOiCES Dataset (25 files, clean speech)

​MUSAN Full (2,016 files, mixed noise/music/speech)

​Speaker Diarization

​Offline Pipeline (VBx)

​Streaming Pipeline (AMI SDM)

​Sortformer (End-to-End Streaming)

​Text-to-Speech

​Kokoro 82M

​Running Benchmarks

Transcription (Parakeet TDT v3)

LibriSpeech (English)

Model Compilation Times

Custom Vocabulary Boosting

Streaming ASR (Parakeet EOU)

Voice Activity Detection (Silero VAD v6)

VOiCES Dataset (25 files, clean speech)

MUSAN Full (2,016 files, mixed noise/music/speech)

Speaker Diarization

Offline Pipeline (VBx)

Streaming Pipeline (AMI SDM)

Sortformer (End-to-End Streaming)

Text-to-Speech

Kokoro 82M

Running Benchmarks