Kokoro TTS

When to Use

Best quality, full generation — Kokoro generates all frames at once. Use when you can wait for complete audio before playback.
Need streaming/immediate playback — Use PocketTTS instead (~80ms to first audio).

Specs

Metric	Value
Parameters	82M
Voices	48
Speed	23x RTFx
Peak RAM	1.5 GB
Architecture	Flow matching + Vocos vocoder
Phonemization	eSpeak-NG (GPL-3.0)

Model: FluidInference/kokoro-82m-coreml

Quick Start

CLI

swift run fluidaudio tts "Welcome to FluidAudio text to speech" \
  --output ~/Desktop/demo.wav \
  --voice af_heart

Swift

import FluidAudioTTS

let manager = TtSManager()
try await manager.initialize()

let audioData = try await manager.synthesize(text: "Hello from FluidAudio!")
try audioData.write(to: URL(fileURLWithPath: "/tmp/demo.wav"))

Chunk Metadata

let detailed = try await manager.synthesizeDetailed(
    text: "FluidAudio can report chunk splits for you.",
    variantPreference: .fifteenSecond
)

for chunk in detailed.chunks {
    print("Chunk #\(chunk.index) -> variant: \(chunk.variant), tokens: \(chunk.tokenCount)")
    print("  text: \(chunk.text)")
}

Pipeline

text → espeak G2P → IPA phonemes → Kokoro model → audio
         ↑                ↑
   custom lexicon    SSML <phoneme>
   overrides here    overrides here

Because espeak runs outside the model as a preprocessing step, you can intercept and edit phonemes before they reach the neural network. This is what enables SSML, custom lexicon, and markdown pronunciation control.

Pronunciation Control

Kokoro supports three ways to override pronunciation:

SSML tags — <phoneme>, <sub>, <say-as>. See SSML documentation.
Custom lexicon — word → IPA mapping files loaded via setCustomLexicon(). See Custom Pronunciation.
Markdown syntax — inline [word](/ipa/) overrides in the input text.

Kokoro vs PocketTTS

	Kokoro	PocketTTS
Pipeline	text → espeak G2P → IPA → model	text → SentencePiece → model
Voice conditioning	Style embedding vector	125 audio prompt tokens
Generation	All frames at once	Frame-by-frame autoregressive
Latency to first audio	Must wait for full generation	~80ms after prefill
SSML support	Yes (`<phoneme>`, `<sub>`, `<say-as>`)	No
Custom lexicon	Yes (word → IPA)	No
Pronunciation control	Full (phoneme-level)	None (model decides internally)
Text preprocessing	Full (numbers, dates, currencies)	Minimal (whitespace, punctuation)

Benchmarks

Same text samples generating 1s to ~300s of output audio, M4 Pro:

Framework	RTFx	Peak RAM	Notes
Swift CoreML	23.2x	1.50 GB	Lowest memory
MLX	23.8x	3.37 GB	—
PyTorch CPU	17.0x	4.85 GB	Known memory leak
PyTorch MPS	10.0x	1.54 GB	Crashes on long strings

CoreML matches MLX speed with 55% less peak RAM. PocketTTS benchmarks coming soon.

Enable in Your Project

Package.swift

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.7.7"),
],
targets: [
    .target(
        name: "YourTarget",
        dependencies: [
            .product(name: "FluidAudioWithTTS", package: "FluidAudio")
        ]
    )
]

Import

import FluidAudio       // Core (ASR, diarization, VAD)
import FluidAudioTTS    // TTS features

Getting Started

Speech Recognition (ASR)

Speaker Diarization

Voice Activity Detection

Text-to-Speech (TTS)

Guides

Reference

When to Use

Specs

Quick Start

CLI

Swift

Chunk Metadata

Pipeline

Pronunciation Control

Kokoro vs PocketTTS

Benchmarks

Enable in Your Project

Package.swift

Import

Getting Started

Speech Recognition (ASR)

Speaker Diarization

Voice Activity Detection

Text-to-Speech (TTS)

Guides

Reference

​When to Use

​Specs

​Quick Start

​CLI

​Swift

​Chunk Metadata

​Pipeline

​Pronunciation Control

​Kokoro vs PocketTTS

​Benchmarks

​Enable in Your Project

​Package.swift

​Import

When to Use

Specs

Quick Start

CLI

Swift

Chunk Metadata

Pipeline

Pronunciation Control

Kokoro vs PocketTTS

Benchmarks

Enable in Your Project

Package.swift

Import