When to Use
- Pre-process audio before ASR — Segment files into speech regions, skip silence. Reduces ASR processing by 30-50%.
- Real-time speech detection — Trigger recording or UI when user starts/stops speaking.
- Improve diarization quality — Filter noise before speaker embedding extraction. Reduces false speakers by 20-40%.
Specs
| Metric | Value |
|---|---|
| Model | Silero VAD v6 |
| Window size | 256ms |
| Memory | Minimal (runs on CPU) |
Offline Segmentation
Get Audio Clips
Chunk-Level Probabilities
Manual Model Loading
Stage the Core ML bundle for offline environments:Benchmarks
VOiCES (25 files, clean speech):| Metric | Value |
|---|---|
| Accuracy | 96.0% |
| Precision | 100.0% |
| Recall | 95.8% |
| F1-Score | 97.9% |
| RTFx | 1,230x |
| Metric | Value |
|---|---|
| Accuracy | 94.2% |
| Precision | 92.6% |
| Recall | 78.9% |
| F1-Score | 85.2% |
| RTFx | 1,221x |