Skip to main content

Overview

Process audio in chunks for real-time speaker labeling. Use this when you need speaker labels while transcription is happening. For most use cases, the offline pipeline is more accurate.

Quick Start

let diarizer = DiarizerManager()
diarizer.initialize(models: models)

var stream = AudioStream(
    chunkDuration: 5.0,
    chunkSkip: 2.0,
    streamStartTime: 0.0,
    chunkingStrategy: .useMostRecent
)

stream.bind { chunk, time in
    let results = try diarizer.performCompleteDiarization(chunk, atTime: time)
    for segment in results.segments {
        handleSpeakerSegment(segment)
    }
}

for audioSamples in audioStream {
    try stream.write(from: audioSamples)
}

Chunk Size Considerations

Chunk SizeAccuracyLatency
< 3 secondsMay fail or unreliableLowest
3-5 secondsMinimum viableLow
10 secondsOptimal (recommended)Medium
> 10 secondsGoodHigher

Real-time Audio Capture

class RealTimeDiarizer {
    private let audioEngine = AVAudioEngine()
    private let diarizer: DiarizerManager
    private var audioStream: AudioStream

    init() async throws {
        let models = try await DiarizerModels.downloadIfNeeded()
        diarizer = DiarizerManager()
        diarizer.initialize(models: models)
        audioStream = AudioStream(
            chunkDuration: 5.0,
            chunkSkip: 3.0,
            streamStartTime: 0.0,
            chunkingStrategy: .useFixedSkip
        )
        audioStream.bind { [weak self] chunk, _ in
            Task {
                let result = try self?.diarizer.performCompleteDiarization(chunk)
                // Handle results
            }
        }
    }

    func startCapture() throws {
        let inputNode = audioEngine.inputNode
        let format = inputNode.outputFormat(forBus: 0)

        inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) {
            [weak self] buffer, _ in
            try? self?.audioStream.write(from: buffer)
        }

        audioEngine.prepare()
        try audioEngine.start()
    }
}

Benchmarks

AMI SDM (meeting recordings, single distant microphone):
Audio LengthOverlapThresholdDERRTFxBest For
5s chunks0s0.826.2%223xBest accuracy/speed balance
10s chunks0s0.733.3%392xHigher throughput
3s chunks1s0.8549.7%51xLowest latency
5s chunks2s0.843.0%69x
Streaming diarization is 10-15% worse DER than offline. Only use streaming when you critically need real-time speaker labels. For most apps, offline is more than fast enough.

Tips

  • Keep one DiarizerManager per stream for consistent speaker IDs
  • Always rebase per-chunk timestamps by (chunkStartSample / sampleRate)
  • Provide 16 kHz mono Float32 samples
  • Tune speakerThreshold and embeddingThreshold to trade off ID stability vs. sensitivity