Custom Vocabulary

Custom vocabulary boosting is batch mode only (Parakeet TDT). It is not supported with streaming ASR (Parakeet EOU).

Overview

FluidAudio’s CTC-based custom vocabulary boosting enables accurate recognition of domain-specific terms (company names, technical jargon, proper nouns) without retraining the ASR model. Based on the NVIDIA NeMo paper: CTC-based Word Spotter.

Architecture

The system uses two encoders processing the same audio:

TDT Encoder (Parakeet 0.6B) — Primary high-quality transcription
CTC Encoder (Parakeet 110M) — Keyword spotting with per-frame log-probabilities

Both encoders produce frames at the same rate (~40ms), enabling direct timestamp comparison.

Quick Start

let asrManager = try await AsrManager.shared
let ctcModels = try await CtcModels.downloadAndLoad()
let ctcSpotter = CtcKeywordSpotter(models: ctcModels)

let vocabulary = CustomVocabularyContext(terms: [
    CustomVocabularyTerm(text: "NVIDIA"),
    CustomVocabularyTerm(text: "TensorRT"),
])

let result = try await asrManager.transcribe(
    audioSamples,
    customVocabulary: vocabulary
)
// result.text: "NVIDIA announced TensorRT optimizations"

Aliases

Handle common misspellings or phonetic variations:

let vocabulary = CustomVocabularyContext(terms: [
    CustomVocabularyTerm(
        text: "Hagen-Dazs",
        aliases: ["Haagen-Dazs", "Hagen-Das", "Hagen Daz"]
    ),
    CustomVocabularyTerm(
        text: "macOS",
        aliases: ["Mac OS", "Mac O S", "Macos"]
    ),
])

When a match is found via canonical or alias, the canonical form is used in the output.

Detection Thresholds

Parameter	Default	Description
`defaultMinSpotterScore`	-15.0	Minimum CTC score for keyword spotting
`defaultMinVocabCtcScore`	-12.0	Minimum CTC score for vocabulary matching
`defaultCbw`	3.0	Context-biasing weight boost
`defaultMinSimilarity`	0.52	Minimum string similarity

Vocabulary Size Guidelines

Size	Performance	Notes
1-50 terms	Excellent	Typical use case
50-100 terms	Good	No noticeable latency
100-230 terms	Tested	Validated with domain-specific lists

Memory

Configuration	Peak RAM
TDT encoder only	~66 MB
TDT + CTC encoders	~130 MB

Why Batch Only

Custom vocabulary requires the complete CTC log-probability matrix for accurate scoring. Streaming ASR processes audio in small chunks (160-320ms), which is too short for reliable keyword spotting and rescoring. Keywords spanning chunk boundaries would be missed, and the rescorer cannot look ahead to future frames for optimal alignment.

Benchmarks

Earnings22 (771 files, 3.2h audio — earnings call transcripts with domain-specific terms):

Metric	Value
Average WER	15.0%
Vocab Precision	99.3% (TP=1068, FP=8)
Vocab Recall	85.2% (TP=1068, FN=185)
Vocab F-score	91.7%
Dict Pass (Recall)	99.3% (1299/1308)
RTFx	63.4x

Precision = “of words we output, how many were correct?” Recall = “of words that should appear, how many did we find?” The 63x RTFx is slower than TDT-only (156x) because two encoders run on the same audio. Still well above real-time.

Getting Started

Speech Recognition (ASR)

Speaker Diarization

Voice Activity Detection

Text-to-Speech (TTS)

Guides

Reference

Overview

Architecture

Quick Start

Aliases

Detection Thresholds

Vocabulary Size Guidelines

Memory

Why Batch Only

Benchmarks

​Overview

​Architecture

​Quick Start

​Aliases

​Detection Thresholds

​Vocabulary Size Guidelines

​Memory

​Why Batch Only

​Benchmarks

Overview

Architecture

Quick Start

Aliases

Detection Thresholds

Vocabulary Size Guidelines

Memory

Why Batch Only

Benchmarks