Overview
Sortformer is NVIDIA’s end-to-end streaming speaker diarization model, converted to CoreML. Unlike the pyannote pipeline (segmentation + clustering), Sortformer is a single neural network with 4 fixed speaker slots.
Model: FluidInference/diar-streaming-sortformer-coreml
Key Properties
- 4 fixed speaker slots with real-time inference
- No separate segmentation + clustering stages
- Streaming only (no offline mode)
- Best for scenarios with 4 or fewer speakers
Sortformer can beat pyannote on benchmarks with certain configs, but benchmark DER does not always reflect production performance. In practice:
| Scenario | Recommendation | Why |
|---|
| Noisy / background noise | Sortformer | More robust to non-speech audio |
| 4 or fewer speakers | Sortformer | Designed for this — single model, no clustering |
| 5+ speakers | Pyannote offline | Sortformer only has 4 speaker slots, will miss speakers |
| Overlapping speech (5+ people) | Pyannote offline | Sortformer breaks down with heavy crosstalk beyond 4 |
| Best overall accuracy | Pyannote offline | 15% DER vs 32% — more consistent in production |
| Streaming required, simple meetings | Sortformer | Single model, no clustering overhead |
Benchmarks are not always consistent with production usage. Pyannote’s offline pipeline with aggressive tuning can score lower DER on AMI, but those configs may not generalize. Sortformer’s 32% DER is more representative of real-world performance on meetings with 4 or fewer speakers.
Benchmarks
AMI SDM (16 meetings, single distant microphone). Audio length: 30.4s chunks (NVIDIA high-latency config):
| Metric | Value |
|---|
| Average DER | 31.7% |
| Average Miss | 21.5% |
| Average FA | 0.5% |
| Average SE | 9.7% |
| Average RTFx | 126.7x |
See full benchmarks for per-meeting breakdown.
CLI
swift run fluidaudio sortformer-benchmark \
--nvidia-high-latency --hf --auto-download