Skip to main content

When to Use

  • Best quality, full generation — Kokoro generates all frames at once. Use when you can wait for complete audio before playback.
  • Need streaming/immediate playback — Use PocketTTS instead (~80ms to first audio).

Specs

MetricValue
Parameters82M
Voices48
Speed23x RTFx
Peak RAM1.5 GB
ArchitectureFlow matching + Vocos vocoder
PhonemizationeSpeak-NG (GPL-3.0)
Model: FluidInference/kokoro-82m-coreml

Quick Start

CLI

swift run fluidaudio tts "Welcome to FluidAudio text to speech" \
  --output ~/Desktop/demo.wav \
  --voice af_heart

Swift

import FluidAudioTTS

let manager = TtSManager()
try await manager.initialize()

let audioData = try await manager.synthesize(text: "Hello from FluidAudio!")
try audioData.write(to: URL(fileURLWithPath: "/tmp/demo.wav"))

Chunk Metadata

let detailed = try await manager.synthesizeDetailed(
    text: "FluidAudio can report chunk splits for you.",
    variantPreference: .fifteenSecond
)

for chunk in detailed.chunks {
    print("Chunk #\(chunk.index) -> variant: \(chunk.variant), tokens: \(chunk.tokenCount)")
    print("  text: \(chunk.text)")
}

Pipeline

text → espeak G2P → IPA phonemes → Kokoro model → audio
         ↑                ↑
   custom lexicon    SSML <phoneme>
   overrides here    overrides here
Because espeak runs outside the model as a preprocessing step, you can intercept and edit phonemes before they reach the neural network. This is what enables SSML, custom lexicon, and markdown pronunciation control.

Pronunciation Control

Kokoro supports three ways to override pronunciation:
  • SSML tags<phoneme>, <sub>, <say-as>. See SSML documentation.
  • Custom lexicon — word → IPA mapping files loaded via setCustomLexicon(). See Custom Pronunciation.
  • Markdown syntax — inline [word](/ipa/) overrides in the input text.

Kokoro vs PocketTTS

KokoroPocketTTS
Pipelinetext → espeak G2P → IPA → modeltext → SentencePiece → model
Voice conditioningStyle embedding vector125 audio prompt tokens
GenerationAll frames at onceFrame-by-frame autoregressive
Latency to first audioMust wait for full generation~80ms after prefill
SSML supportYes (<phoneme>, <sub>, <say-as>)No
Custom lexiconYes (word → IPA)No
Pronunciation controlFull (phoneme-level)None (model decides internally)
Text preprocessingFull (numbers, dates, currencies)Minimal (whitespace, punctuation)

Benchmarks

Same text samples generating 1s to ~300s of output audio, M4 Pro:
FrameworkRTFxPeak RAMNotes
Swift CoreML23.2x1.50 GBLowest memory
MLX23.8x3.37 GB
PyTorch CPU17.0x4.85 GBKnown memory leak
PyTorch MPS10.0x1.54 GBCrashes on long strings
CoreML matches MLX speed with 55% less peak RAM. PocketTTS benchmarks coming soon.

Enable in Your Project

Package.swift

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.7.7"),
],
targets: [
    .target(
        name: "YourTarget",
        dependencies: [
            .product(name: "FluidAudioWithTTS", package: "FluidAudio")
        ]
    )
]

Import

import FluidAudio       // Core (ASR, diarization, VAD)
import FluidAudioTTS    // TTS features