When to Use
- Best quality, full generation — Kokoro generates all frames at once. Use when you can wait for complete audio before playback.
- Need streaming/immediate playback — Use PocketTTS instead (~80ms to first audio).
Specs
| Metric | Value |
|---|---|
| Parameters | 82M |
| Voices | 48 |
| Speed | 23x RTFx |
| Peak RAM | 1.5 GB |
| Architecture | Flow matching + Vocos vocoder |
| Phonemization | eSpeak-NG (GPL-3.0) |
Quick Start
CLI
Swift
Chunk Metadata
Pipeline
Pronunciation Control
Kokoro supports three ways to override pronunciation:- SSML tags —
<phoneme>,<sub>,<say-as>. See SSML documentation. - Custom lexicon — word → IPA mapping files loaded via
setCustomLexicon(). See Custom Pronunciation. - Markdown syntax — inline
[word](/ipa/)overrides in the input text.
Kokoro vs PocketTTS
| Kokoro | PocketTTS | |
|---|---|---|
| Pipeline | text → espeak G2P → IPA → model | text → SentencePiece → model |
| Voice conditioning | Style embedding vector | 125 audio prompt tokens |
| Generation | All frames at once | Frame-by-frame autoregressive |
| Latency to first audio | Must wait for full generation | ~80ms after prefill |
| SSML support | Yes (<phoneme>, <sub>, <say-as>) | No |
| Custom lexicon | Yes (word → IPA) | No |
| Pronunciation control | Full (phoneme-level) | None (model decides internally) |
| Text preprocessing | Full (numbers, dates, currencies) | Minimal (whitespace, punctuation) |
Benchmarks
Same text samples generating 1s to ~300s of output audio, M4 Pro:| Framework | RTFx | Peak RAM | Notes |
|---|---|---|---|
| Swift CoreML | 23.2x | 1.50 GB | Lowest memory |
| MLX | 23.8x | 3.37 GB | — |
| PyTorch CPU | 17.0x | 4.85 GB | Known memory leak |
| PyTorch MPS | 10.0x | 1.54 GB | Crashes on long strings |