Skip to main content

Overview

Sortformer is NVIDIA’s end-to-end streaming speaker diarization model, converted to CoreML. Unlike the pyannote pipeline (segmentation + clustering), Sortformer is a single neural network with 4 fixed speaker slots. Model: FluidInference/diar-streaming-sortformer-coreml

Key Properties

  • 4 fixed speaker slots with real-time inference
  • No separate segmentation + clustering stages
  • Streaming only (no offline mode)
  • Best for scenarios with 4 or fewer speakers

When to Use Sortformer vs Pyannote

Sortformer can beat pyannote on benchmarks with certain configs, but benchmark DER does not always reflect production performance. In practice:
ScenarioRecommendationWhy
Noisy / background noiseSortformerMore robust to non-speech audio
4 or fewer speakersSortformerDesigned for this — single model, no clustering
5+ speakersPyannote offlineSortformer only has 4 speaker slots, will miss speakers
Overlapping speech (5+ people)Pyannote offlineSortformer breaks down with heavy crosstalk beyond 4
Best overall accuracyPyannote offline15% DER vs 32% — more consistent in production
Streaming required, simple meetingsSortformerSingle model, no clustering overhead
Benchmarks are not always consistent with production usage. Pyannote’s offline pipeline with aggressive tuning can score lower DER on AMI, but those configs may not generalize. Sortformer’s 32% DER is more representative of real-world performance on meetings with 4 or fewer speakers.

Benchmarks

AMI SDM (16 meetings, single distant microphone). Audio length: 30.4s chunks (NVIDIA high-latency config):
MetricValue
Average DER31.7%
Average Miss21.5%
Average FA0.5%
Average SE9.7%
Average RTFx126.7x
See full benchmarks for per-meeting breakdown.

CLI

swift run fluidaudio sortformer-benchmark \
  --nvidia-high-latency --hf --auto-download