Converting Models - Fluid Inference

Each model conversion in möbius follows a three-step workflow: export, validate, quantize. The scripts and dependencies are self-contained per model directory.

Workflow

PyTorch model
    │
    ▼
convert-*.py          Export to .mlpackage (CoreML)
    │
    ▼
compare-*.py          Validate numerical parity + measure latency
    │
    ▼
quantize_coreml.py    Sweep quantization variants (optional)
    │
    ▼
.mlmodelc / .mlpackage   Ready for FluidAudio or direct CoreML usage

Step 1: Export

Each model directory contains a conversion script (e.g., convert-parakeet.py, convert-coreml.py). The script:

Loads the original PyTorch / NeMo / ONNX model
Traces or scripts the model with fixed input shapes
Converts to CoreML using coremltools
Saves .mlpackage files

cd models/stt/parakeet-tdt-v3-0.6b/coreml
uv sync
uv run python convert-parakeet.py convert \
  --nemo-path /path/to/model.nemo \
  --output-dir ./build

Fixed Input Shapes

CoreML requires static shapes at export time. Each model defines its input contract:

Model	Input Shape	Duration
Parakeet TDT v3	240,000 samples	15s at 16kHz
Parakeet EOU	5,120 samples	320ms at 16kHz
Silero VAD	576 samples	36ms at 16kHz
Silero VAD (256ms)	4,160 samples	256ms at 16kHz
Kokoro (5s variant)	Variable tokens	~5s output
Kokoro (15s variant)	Variable tokens	~15s output

Step 2: Validate

Parity scripts run the PyTorch and CoreML models side-by-side on identical inputs, comparing outputs numerically and measuring latency.

uv run python compare-components.py compare \
  --output-dir ./build \
  --model-id nvidia/parakeet-tdt-0.6b-v3 \
  --runs 10 --warmup 3

This produces:

Numerical diff — max absolute error, max relative error, match/no-match per component
Latency comparison — Torch CPU vs CoreML (CPU+ANE) with speedup ratios
Plots — visual comparisons saved to plots/ directory
metadata.json — structured results for CI and reporting

Example Parity Results (Parakeet TDT v3)

Component	Max Abs Error	Match	Torch CPU	CoreML ANE	Speedup
Encoder	0.005	Yes	1030ms	25ms	40x
Preprocessor	0.484	Yes	2.0ms	1.2ms	1.7x
Decoder	tolerance	Yes	7.5ms	4.3ms	1.7x
Joint	0.099	Yes	28ms	23ms	1.3x

Step 3: Quantize (Optional)

Quantization reduces model size and can improve latency on ANE. The sweep evaluates multiple strategies and reports the trade-offs.

uv run python quantize_coreml.py \
  --input-dir ./build \
  --output-root ./build_quantized \
  --compute-units ALL --runs 10

Quantization Strategies

Strategy	Size Reduction	Quality Impact	Best For
INT8 per-channel	~2x smaller	Minimal loss	General deployment
INT8 per-tensor	~2x smaller	Significant loss on large models	Small models only
6-bit palettization	~2.5x smaller	Varies by model	Size-constrained devices

Results are saved to quantization_summary.json with per-component quality scores (1.0 = identical to baseline).

Common CoreML Modifications

PyTorch models often need modifications for CoreML tracing. Common patterns:

PyTorch Feature	CoreML Fix
`pack_padded_sequence`	Explicit LSTM states + masking
Dynamic shapes / loops	Fixed shapes, broadcasting
In-place operations	Pure functional transforms
Random generation	Deterministic inputs passed externally
Complex number ops	Real/imaginary split

Adding a New Model

Create the directory: models/{class}/{name}/coreml/
Add pyproject.toml with dependencies
Write convert-*.py — export script
Write compare-*.py — validation script (optional but recommended)
Add README.md documenting the conversion
Push converted weights to Hugging Face

mkdir -p models/stt/my-new-model/coreml
cd models/stt/my-new-model/coreml

# Initialize with uv
uv init
uv add coremltools torch

# Write your conversion script
# ... convert-my-model.py

Deployment Targets

Minimum: iOS 17 / macOS 14
Format: MLProgram (.mlpackage for development, .mlmodelc for compiled)
Compute units: Models traced with CPU_ONLY for determinism; runtime compute units set when loading (.cpuAndNeuralEngine, .cpuAndGPU, .all)

Getting Started

​Workflow

​Step 1: Export

​Fixed Input Shapes

​Step 2: Validate

​Example Parity Results (Parakeet TDT v3)

​Step 3: Quantize (Optional)

​Quantization Strategies

​Common CoreML Modifications

​Adding a New Model

​Deployment Targets