Documentation Index
Fetch the complete documentation index at: https://docs.fluidinference.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
PocketTTS (~155M params) is an autoregressive TTS backend that generates audio frame-by-frame. No espeak dependency — uses SentencePiece tokenization directly. Audio starts streaming ~80ms after prefill.
Model: FluidInference/pocket-tts-coreml
Quick Start
import FluidAudioTTS
let manager = PocketTtsManager()
try await manager.initialize()
let audioData = try await manager.synthesize(text: "Hello, world!")
try await manager.synthesizeToFile(
text: "Hello, world!",
outputURL: URL(fileURLWithPath: "/tmp/output.wav")
)
Architecture
PocketTtsManager.synthesize(text:)
→ chunkText() — split into max 50 token chunks
→ loadMimiInitialState() — 23 streaming state tensors
→ FOR EACH CHUNK:
→ tokenizer.encode() — SentencePiece
→ embedTokens() — table lookup
→ prefillKVCache() — 125 voice + N text tokens
→ GENERATE LOOP:
→ runFlowLMStep() — transformer_out + eos_logit
→ flowDecode() — 8 Euler steps → 32-dim latent
→ denormalize() → quantize() → runMimiDecoder()
→ 1920 audio samples per frame
→ concatenate + postprocess
→ WAV output (24kHz mono)
Key State
KV Cache
- 6 cache tensors
[2, 1, 512, 16, 64] + 6 position counters
- Reset per chunk
Mimi State
- 23 tensors for convolution history, attention caches, overlap-add buffers
- Continuous across chunks — keeps audio seamless
Text Chunking
Long text splits at 50 tokens or fewer:
- Sentence boundaries (
.!?)
- Clause boundaries (
,;:)
- Word boundaries (fallback)
Pipeline
text → SentencePiece tokenizer → subword tokens → PocketTTS model → audio
↑
pronunciation decisions
happen inside model weights
(no external control)
Unlike Kokoro which uses espeak to convert text to IPA phonemes before the model, PocketTTS feeds raw text tokens directly into the neural network. The model learned text→pronunciation mappings during training — there is no phoneme stage to intercept.
Pronunciation Control
| Feature | Supported | Why |
|---|
SSML <phoneme> | No | No IPA layer — model has no phoneme vocabulary |
| Custom lexicon (word → IPA) | No | No phoneme stage to apply mappings |
Markdown [word](/ipa/) | No | Same — no phoneme input |
SSML <sub> (text substitution) | Planned | Text-level, can run before tokenizer |
| Text preprocessing (numbers, dates) | Planned | Text-level, can run before tokenizer |
What can be added — anything that operates on text before the SentencePiece tokenizer: number/date/currency expansion, text substitution, abbreviation expansion.
What cannot be added without retraining — anything that requires phoneme-level control. The model decides pronunciation from text tokens alone. See Kokoro if you need pronunciation control.
CoreML Details
- All 4 models loaded with
.cpuAndGPU (ANE float16 causes artifacts in Mimi state)
- Compiled from
.mlpackage → .mlmodelc on first load, cached on disk
- Thread-safe via actor pattern
Benchmarks
Benchmarks in progress. Methodology follows Kyutai’s evaluation and their tts_longeval toolkit.
Upstream (Kyutai, CPU)
LibriSpeech test-clean, WER via Whisper large-v3:
| Metric | PocketTTS (100M) | F5-TTS | DSM (313M) |
|---|
| WER | 1.84% | 2.21% | 1.84% |
| Audio Quality (ELO) | 2016 | — | — |
| Speaker Similarity (ELO) | 1898 | — | — |
| Runs on CPU | Yes (6x real-time) | No | No |
ELO from human pairwise evaluation (50 raters, 50 samples). Tested on Apple M3 and Intel Core Ultra 7.
FluidAudio CoreML (planned)
We will benchmark the CoreML port against the upstream PyTorch CPU baseline using the same methodology:
| Metric | How | Dataset |
|---|
| WER | Transcribe TTS output with Whisper large-v3, compare to input text | LibriSpeech test-clean |
| Speaker Similarity | WavLM cosine similarity between prompt audio and generated audio | LibriSpeech test-clean |
| RTFx | Wall-clock generation time / audio duration | Variable length (1s to 300s) |
| Time to First Audio | Time from synthesize() call to first audio frame | Single sentence |
| Peak RAM | Instruments / os_proc_memory during generation | Variable length |
Additional datasets from tts_longeval:
- NTREX — monologue sentences from news translation corpus
- Synthetic Dialogs — daily life, technical, and number-heavy scripts
- SEED English — adapted from ByteDance’s SEED TTS Eval
Key comparisons: CoreML ANE vs PyTorch CPU (upstream), CoreML vs Kokoro CoreML (FluidAudio internal).
License
CC-BY-4.0, inherited from kyutai/pocket-tts.