Skip to main content

When to Use

  • Post-recording analysis (meetings, interviews) — Use the Offline pipeline. 15% DER, 122x real-time.
  • Real-time “who’s speaking now” — Use Streaming diarization. 26% DER at 5s chunks. Only use when you critically need real-time labels — offline is more accurate and still very fast.
  • Simple 2-4 speaker conversations — Consider Sortformer. Single model, no clustering, 32% DER. Better in noisy environments but limited to 4 speakers max — does not work well with 5+ people or heavy crosstalk.

Quick Start

import FluidAudio

let models = try await DiarizerModels.downloadIfNeeded()
let diarizer = DiarizerManager()
diarizer.initialize(models: models)

let samples = try AudioConverter().resampleAudioFile(
    URL(fileURLWithPath: "meeting.wav")
)
let result = try diarizer.performCompleteDiarization(samples)

for segment in result.segments {
    print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}

Configuration

let config = DiarizerConfig(
    clusteringThreshold: 0.7,      // Speaker separation (0.0-1.0)
    minSpeechDuration: 1.0,         // Minimum segment duration (seconds)
    minSilenceGap: 0.5,             // Minimum silence between speakers
    minActiveFramesCount: 10.0,     // Minimum active frames
    debugMode: false
)
let diarizer = DiarizerManager(config: config)

Known Speaker Recognition

Pre-load speaker profiles for identification:
let aliceAudio = loadAudioFile("alice_sample.wav")
let aliceEmbedding = try diarizer.extractEmbedding(aliceAudio)

let alice = Speaker(id: "Alice", name: "Alice", currentEmbedding: aliceEmbedding)
let bob = Speaker(id: "Bob", name: "Bob", currentEmbedding: bobEmbedding)
diarizer.speakerManager.initializeKnownSpeakers([alice, bob])

// Will use "Alice" instead of "Speaker_1" when matched
let result = try diarizer.performCompleteDiarization(audioSamples)

Manual Model Loading

Stage Core ML bundles for offline deployment:
let basePath = "/opt/models/speaker-diarization-coreml"
let segmentation = URL(fileURLWithPath: basePath)
    .appendingPathComponent("pyannote_segmentation.mlmodelc")
let embedding = URL(fileURLWithPath: basePath)
    .appendingPathComponent("wespeaker_v2.mlmodelc")

let models = try await DiarizerModels.load(
    localSegmentationModel: segmentation,
    localEmbeddingModel: embedding
)

Benchmarks

VoxConverse (232 clips, multi-speaker conversations):
PipelineAudio LengthDERRTFx
Offline (default)10s windows15.1%122x
Offline (max accuracy)10s windows13.9%65x
Streaming5s chunks26.2%223x
Sortformer30.4s chunks31.7%127x
Device comparison (offline pipeline, default config):
DeviceRTFx
M2 MacBook Air150x
M1 iPad Pro120x
iPhone 14 Pro80x

CLI

swift run fluidaudio process meeting.wav --output results.json --threshold 0.6
swift run fluidaudio diarization-benchmark --auto-download