Overview
PocketTTS (~155M params) is an autoregressive TTS backend that generates audio frame-by-frame. No espeak dependency — uses SentencePiece tokenization directly. Audio starts streaming ~80ms after prefill. Model: FluidInference/pocket-tts-coremlQuick Start
Architecture
Key State
KV Cache
- 6 cache tensors
[2, 1, 512, 16, 64]+ 6 position counters - Reset per chunk
Mimi State
- 23 tensors for convolution history, attention caches, overlap-add buffers
- Continuous across chunks — keeps audio seamless
Text Chunking
Long text splits at 50 tokens or fewer:- Sentence boundaries (
.!?) - Clause boundaries (
,;:) - Word boundaries (fallback)
Pipeline
Pronunciation Control
| Feature | Supported | Why |
|---|---|---|
SSML <phoneme> | No | No IPA layer — model has no phoneme vocabulary |
| Custom lexicon (word → IPA) | No | No phoneme stage to apply mappings |
Markdown [word](/ipa/) | No | Same — no phoneme input |
SSML <sub> (text substitution) | Planned | Text-level, can run before tokenizer |
| Text preprocessing (numbers, dates) | Planned | Text-level, can run before tokenizer |
CoreML Details
- All 4 models loaded with
.cpuAndGPU(ANE float16 causes artifacts in Mimi state) - Compiled from
.mlpackage→.mlmodelcon first load, cached on disk - Thread-safe via actor pattern
Benchmarks
Benchmarks in progress. Methodology follows Kyutai’s evaluation and their tts_longeval toolkit.Upstream (Kyutai, CPU)
LibriSpeech test-clean, WER via Whisper large-v3:| Metric | PocketTTS (100M) | F5-TTS | DSM (313M) |
|---|---|---|---|
| WER | 1.84% | 2.21% | 1.84% |
| Audio Quality (ELO) | 2016 | — | — |
| Speaker Similarity (ELO) | 1898 | — | — |
| Runs on CPU | Yes (6x real-time) | No | No |
FluidAudio CoreML (planned)
We will benchmark the CoreML port against the upstream PyTorch CPU baseline using the same methodology:| Metric | How | Dataset |
|---|---|---|
| WER | Transcribe TTS output with Whisper large-v3, compare to input text | LibriSpeech test-clean |
| Speaker Similarity | WavLM cosine similarity between prompt audio and generated audio | LibriSpeech test-clean |
| RTFx | Wall-clock generation time / audio duration | Variable length (1s to 300s) |
| Time to First Audio | Time from synthesize() call to first audio frame | Single sentence |
| Peak RAM | Instruments / os_proc_memory during generation | Variable length |
- NTREX — monologue sentences from news translation corpus
- Synthetic Dialogs — daily life, technical, and number-heavy scripts
- SEED English — adapted from ByteDance’s SEED TTS Eval