Skip to main content
Hardware: 2024 MacBook Pro, M4 Pro, 48GB RAM, macOS Tahoe 26.0 (unless noted).

Transcription (Parakeet TDT v3)

25 European languages on FLEURS:
LanguageWER%CER%RTFxFiles
Italian4.01.3236.7350
Spanish4.52.2221.7350
English (US)5.42.5207.4350
French5.92.2199.9350
German5.91.9220.9350
Russian7.22.2209.7350
Ukrainian7.22.5201.9350
Dutch7.82.6191.7350
Polish8.62.8190.2350
Czech12.03.8214.2350
Slovak12.64.4227.6350
Bulgarian12.84.1195.2350
Croatian14.04.3204.9350
Romanian14.44.7200.4883
Finnish14.83.1222.0918
Swedish16.85.0219.5759
Hungarian17.65.2213.6905
Danish20.27.4214.4930
Estonian20.14.2225.3893
Maltese25.29.3217.4926
Lithuanian25.06.8202.8986
Latvian27.17.5217.8851
Slovenian27.49.2197.1834
Greek36.913.7183.0650
Average14.74.7209.814,085

LibriSpeech (English)

ModelDatasetWER%CER%RTFxFiles
TDT v3test-clean2.5%1.0%155.6x2,620
TDT v2test-clean2.1%0.7%145.8x2,620
v2 has lower English WER — use it if you only need English.

Model Compilation Times

First-load CoreML compile times (ANE compilation):
ModeliPhone 16 Pro Max (cold)iPhone 16 Pro Max (warm)iPhone 13 (cold)
Preprocessor9ms633ms
Encoder3,361ms162ms4,396ms
Decoder88ms8ms146ms
JointDecision48ms8ms72ms
Cold start = first load after install. Warm = subsequent loads from ANE cache.

Custom Vocabulary Boosting

Earnings22 benchmark (771 files, earnings call transcripts with domain-specific terms):
MetricValue
Average WER15.0%
Vocab Precision99.3% (TP=1068, FP=8)
Vocab Recall85.2% (TP=1068, FN=185)
Vocab F-score91.7%
Dict Pass (Recall)99.3% (1299/1308)
RTFx63.4x
Total audio11,565s

Streaming ASR (Parakeet EOU)

Hardware: Apple M2, 2022, macOS 26. LibriSpeech test-clean (2,620 files, 5.4h audio):
Chunk SizeWER (Avg)RTFxTotal Time
320ms4.87%12.48x26min
160ms8.29%4.78x68min
320ms is the recommended default — best accuracy/latency tradeoff.

Voice Activity Detection (Silero VAD v6)

VOiCES Dataset (25 files, clean speech)

MetricValue
Accuracy96.0%
Precision100.0%
Recall95.8%
F1-Score97.9%
RTFx1,230.6x

MUSAN Full (2,016 files, mixed noise/music/speech)

MetricValue
Accuracy94.2%
Precision92.6%
Recall78.9%
F1-Score85.2%
RTFx1,220.7x

Speaker Diarization

Offline Pipeline (VBx)

VoxConverse dataset (232 clips):
ConfigDER%JER%RTFx
Step ratio 0.2, min duration 1.0s (default)15.1%39.4%122x
Step ratio 0.1, min duration 0s (max accuracy)13.9%42.8%65x
The default is ~2x faster for only ~1.2% worse DER. Use step ratio 0.1 for critical accuracy. Reference: pyannote community-1 on CPU is 1.5-2x RTFx, on MPS is 20-25x RTFx. FluidAudio on ANE is 65-122x RTFx.

Streaming Pipeline (AMI SDM)

ChunkOverlapThresholdDER%RTFxBest For
5s0s0.826.2%223xBest accuracy/speed balance
10s0s0.733.3%392xHigher throughput
3s1s0.8549.7%51xLowest latency
5s2s0.843.0%69x
5s chunks with 0.8 threshold is the recommended starting point for streaming.
Streaming diarization is 10-15% worse DER than offline. Only use streaming when you critically need real-time speaker labels. For most apps, offline is more than fast enough.

Sortformer (End-to-End Streaming)

Hardware: Apple M2, 2022, macOS 26.1. AMI SDM dataset, NVIDIA high-latency config (30.4s chunks):
MeetingDER%Miss%FA%SE%RTFx
IS1009b16.410.60.65.3127.0
ES2004c23.817.80.35.7126.5
ES2004b23.918.70.25.0123.9
IS1009a26.516.01.49.1134.4
ES2004d28.319.70.38.3123.5
IS1009d29.116.51.011.6127.9
TS3003b31.127.10.63.4125.5
EN2002c31.820.10.211.5126.0
ES2004a33.724.60.19.0127.2
EN2002b34.020.20.613.3127.7
TS3003c34.431.10.33.1126.6
EN2002a35.620.00.415.2125.4
EN2002d37.120.10.516.5125.5
IS1009c38.112.80.924.4129.2
TS3003d41.032.00.18.8125.6
TS3003a41.836.80.74.3125.7
Average31.721.50.59.7126.7

Text-to-Speech

Comparison across frameworks generating the same text samples (1s to ~300s of output audio):

Kokoro 82M

FrameworkTotal RTFxPeak RAMNotes
PyTorch CPU17.0x4.85 GBKnown memory leak
PyTorch MPS10.0x1.54 GBCrashes on long strings
MLX23.8x3.37 GB
Swift CoreML23.2x1.50 GBLowest memory
CoreML matches MLX speed with 55% less peak RAM. First run takes ~15s for ANE compilation, subsequent loads ~2s.

Running Benchmarks

# Transcription (all languages)
swift run -c release fluidaudio fleurs-benchmark --languages all --samples all

# Transcription (English, LibriSpeech)
swift run -c release fluidaudio asr-benchmark --max-files all

# Custom vocabulary
swift run -c release fluidaudio ctc-earnings-benchmark --auto-download

# Streaming ASR
swift run -c release fluidaudio parakeet-eou --benchmark --chunk-size 320 --use-cache

# VAD
swift run -c release fluidaudio vad-benchmark --dataset voices-subset --all-files --threshold 0.85

# Diarization (offline)
swift run -c release fluidaudio diarization-benchmark --mode offline --auto-download

# Diarization (streaming)
swift run -c release fluidaudio diarization-benchmark --mode streaming \
  --dataset ami-sdm --threshold 0.8 --chunk-seconds 5.0 --overlap-seconds 0.0

# Sortformer
swift run -c release fluidaudio sortformer-benchmark --nvidia-high-latency --hf --auto-download

# TTS
swift run -c release fluidaudio tts --benchmark