Skip to main content

Overview

OfflineDiarizerManager provides the full pyannote/Core ML exporter pipeline (powerset segmentation + VBx clustering) for highest accuracy offline diarization. Requires macOS 14 / iOS 17 or later.

Quick Start

import FluidAudio

let config = OfflineDiarizerConfig()
let manager = OfflineDiarizerManager(config: config)
try await manager.prepareModels()

let samples = try AudioConverter().resampleAudioFile(path: "meeting.wav")
let result = try await manager.process(audio: samples)

for segment in result.segments {
    print("\(segment.speakerId) \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}

File-Based API

For large files, use memory-mapped streaming:
let url = URL(fileURLWithPath: "meeting.wav")
let result = try await manager.process(url)

Pipeline Stages

  1. Segmentation — 10s/160k sample chunks through Core ML segmentation (589 frame-level log probabilities)
  2. Binarization — Log probabilities to soft VAD weights
  3. Weight Interpolationscipy.ndimage.zoom-compatible half-pixel mapping
  4. Embedding Extraction — FBANK + embedding backend, L2-normalized 256-d embeddings
  5. VBx Clustering — AHC warm start + PLDA + iterative VBx refinement
  6. Timeline Reconstruction — Timestamps with minimum gap/duration constraints

Configuration

OfflineDiarizerConfig groups knobs by pipeline stage:
  • segmentation — Window length (10s), step ratio, min on/off durations
  • embedding — Batch size, overlap handling
  • clustering — VBx warm-start threshold, Fa/Fb priors
  • vbx — Max iterations, convergence tolerance
  • postProcessing — Minimum gap duration
  • export — Optional embeddingsPath for JSON dump

Benchmarks

VoxConverse (232 clips, multi-speaker conversations). Segmentation uses 10s windows:
ConfigAudio LengthDERJERRTFx
Step ratio 0.2, min duration 1.0s (default)10s windows15.1%39.4%122x
Step ratio 0.1, min duration 0s (max accuracy)10s windows13.9%42.8%65x
Default is ~2x faster for only ~1.2% worse DER. Use step ratio 0.1 for critical accuracy. Reference: pyannote community-1 on CPU is 1.5-2x RTFx, on MPS is 20-25x RTFx. FluidAudio on ANE is 65-122x RTFx.

CLI

# Process a single file
swift run fluidaudio process meeting.wav --mode offline --threshold 0.6

# Benchmark on AMI dataset
swift run fluidaudio diarization-benchmark --mode offline \
  --dataset ami-sdm --threshold 0.6 --auto-download

# With ground-truth RTTM
swift run fluidaudio process meeting.wav --mode offline \
  --rttm ground_truth.rttm