When to Use
- Post-recording analysis (meetings, interviews) — Use the Offline pipeline. 15% DER, 122x real-time.
- Real-time “who’s speaking now” — Use Streaming diarization. 26% DER at 5s chunks. Only use when you critically need real-time labels — offline is more accurate and still very fast.
- Simple 2-4 speaker conversations — Consider Sortformer. Single model, no clustering, 32% DER. Better in noisy environments but limited to 4 speakers max — does not work well with 5+ people or heavy crosstalk.
Quick Start
Configuration
Known Speaker Recognition
Pre-load speaker profiles for identification:Manual Model Loading
Stage Core ML bundles for offline deployment:Benchmarks
VoxConverse (232 clips, multi-speaker conversations):| Pipeline | Audio Length | DER | RTFx |
|---|---|---|---|
| Offline (default) | 10s windows | 15.1% | 122x |
| Offline (max accuracy) | 10s windows | 13.9% | 65x |
| Streaming | 5s chunks | 26.2% | 223x |
| Sortformer | 30.4s chunks | 31.7% | 127x |
| Device | RTFx |
|---|---|
| M2 MacBook Air | 150x |
| M1 iPad Pro | 120x |
| iPhone 14 Pro | 80x |