Skip to main content
Custom vocabulary boosting is batch mode only (Parakeet TDT). It is not supported with streaming ASR (Parakeet EOU).

Overview

FluidAudio’s CTC-based custom vocabulary boosting enables accurate recognition of domain-specific terms (company names, technical jargon, proper nouns) without retraining the ASR model. Based on the NVIDIA NeMo paper: CTC-based Word Spotter.

Architecture

The system uses two encoders processing the same audio:
  1. TDT Encoder (Parakeet 0.6B) — Primary high-quality transcription
  2. CTC Encoder (Parakeet 110M) — Keyword spotting with per-frame log-probabilities
Both encoders produce frames at the same rate (~40ms), enabling direct timestamp comparison.

Quick Start

let asrManager = try await AsrManager.shared
let ctcModels = try await CtcModels.downloadAndLoad()
let ctcSpotter = CtcKeywordSpotter(models: ctcModels)

let vocabulary = CustomVocabularyContext(terms: [
    CustomVocabularyTerm(text: "NVIDIA"),
    CustomVocabularyTerm(text: "TensorRT"),
])

let result = try await asrManager.transcribe(
    audioSamples,
    customVocabulary: vocabulary
)
// result.text: "NVIDIA announced TensorRT optimizations"

Aliases

Handle common misspellings or phonetic variations:
let vocabulary = CustomVocabularyContext(terms: [
    CustomVocabularyTerm(
        text: "Hagen-Dazs",
        aliases: ["Haagen-Dazs", "Hagen-Das", "Hagen Daz"]
    ),
    CustomVocabularyTerm(
        text: "macOS",
        aliases: ["Mac OS", "Mac O S", "Macos"]
    ),
])
When a match is found via canonical or alias, the canonical form is used in the output.

Detection Thresholds

ParameterDefaultDescription
defaultMinSpotterScore-15.0Minimum CTC score for keyword spotting
defaultMinVocabCtcScore-12.0Minimum CTC score for vocabulary matching
defaultCbw3.0Context-biasing weight boost
defaultMinSimilarity0.52Minimum string similarity

Vocabulary Size Guidelines

SizePerformanceNotes
1-50 termsExcellentTypical use case
50-100 termsGoodNo noticeable latency
100-230 termsTestedValidated with domain-specific lists

Memory

ConfigurationPeak RAM
TDT encoder only~66 MB
TDT + CTC encoders~130 MB

Why Batch Only

Custom vocabulary requires the complete CTC log-probability matrix for accurate scoring. Streaming ASR processes audio in small chunks (160-320ms), which is too short for reliable keyword spotting and rescoring. Keywords spanning chunk boundaries would be missed, and the rescorer cannot look ahead to future frames for optimal alignment.

Benchmarks

Earnings22 (771 files, 3.2h audio — earnings call transcripts with domain-specific terms):
MetricValue
Average WER15.0%
Vocab Precision99.3% (TP=1068, FP=8)
Vocab Recall85.2% (TP=1068, FN=185)
Vocab F-score91.7%
Dict Pass (Recall)99.3% (1299/1308)
RTFx63.4x
Precision = “of words we output, how many were correct?” Recall = “of words that should appear, how many did we find?” The 63x RTFx is slower than TDT-only (156x) because two encoders run on the same audio. Still well above real-time.