Skip to main content

When to Use

  • Pre-process audio before ASR — Segment files into speech regions, skip silence. Reduces ASR processing by 30-50%.
  • Real-time speech detection — Trigger recording or UI when user starts/stops speaking.
  • Improve diarization quality — Filter noise before speaker embedding extraction. Reduces false speakers by 20-40%.

Specs

MetricValue
ModelSilero VAD v6
Window size256ms
MemoryMinimal (runs on CPU)
Model: FluidInference/silero-vad-coreml

Offline Segmentation

import FluidAudio

let manager = try await VadManager(
    config: VadConfig(defaultThreshold: 0.75)
)

let samples = try AudioConverter().resampleAudioFile(
    URL(fileURLWithPath: "audio.wav")
)

var segmentation = VadSegmentationConfig.default
segmentation.minSpeechDuration = 0.25
segmentation.minSilenceDuration = 0.4
segmentation.speechPadding = 0.12

let segments = try await manager.segmentSpeech(samples, config: segmentation)
for (index, segment) in segments.enumerated() {
    print(String(format: "Segment %02d: %.2f-%.2fs", index + 1, segment.startTime, segment.endTime))
}

Get Audio Clips

let clips = try await manager.segmentSpeechAudio(samples, config: segmentation)
print("Extracted \(clips.count) buffered segments ready for ASR")

Chunk-Level Probabilities

let results = try await manager.process(samples)
for (index, chunk) in results.enumerated() {
    print(String(format: "Chunk %02d: prob=%.3f", index, chunk.probability))
}

Manual Model Loading

Stage the Core ML bundle for offline environments:
let modelURL = URL(
    fileURLWithPath: "/opt/models/silero-vad-coreml/silero-vad-unified-256ms-v6.0.0.mlmodelc",
    isDirectory: true
)
var configuration = MLModelConfiguration()
configuration.computeUnits = .cpuOnly
let vadModel = try MLModel(contentsOf: modelURL, configuration: configuration)
let manager = VadManager(config: .default, vadModel: vadModel)

Benchmarks

VOiCES (25 files, clean speech):
MetricValue
Accuracy96.0%
Precision100.0%
Recall95.8%
F1-Score97.9%
RTFx1,230x
MUSAN (2,016 files, mixed noise/music/speech):
MetricValue
Accuracy94.2%
Precision92.6%
Recall78.9%
F1-Score85.2%
RTFx1,221x

CLI

# Offline segmentation
swift run fluidaudio vad-analyze audio.wav

# Streaming mode
swift run fluidaudio vad-analyze audio.wav --streaming --min-silence-ms 300

# Both modes
swift run fluidaudio vad-analyze audio.wav --mode both

# Benchmark
swift run fluidaudio vad-benchmark --num-files 50 --threshold 0.3