Skip to main content
Each model conversion in möbius follows a three-step workflow: export, validate, quantize. The scripts and dependencies are self-contained per model directory.

Workflow

PyTorch model


convert-*.py          Export to .mlpackage (CoreML)


compare-*.py          Validate numerical parity + measure latency


quantize_coreml.py    Sweep quantization variants (optional)


.mlmodelc / .mlpackage   Ready for FluidAudio or direct CoreML usage

Step 1: Export

Each model directory contains a conversion script (e.g., convert-parakeet.py, convert-coreml.py). The script:
  1. Loads the original PyTorch / NeMo / ONNX model
  2. Traces or scripts the model with fixed input shapes
  3. Converts to CoreML using coremltools
  4. Saves .mlpackage files
cd models/stt/parakeet-tdt-v3-0.6b/coreml
uv sync
uv run python convert-parakeet.py convert \
  --nemo-path /path/to/model.nemo \
  --output-dir ./build

Fixed Input Shapes

CoreML requires static shapes at export time. Each model defines its input contract:
ModelInput ShapeDuration
Parakeet TDT v3240,000 samples15s at 16kHz
Parakeet EOU5,120 samples320ms at 16kHz
Silero VAD576 samples36ms at 16kHz
Silero VAD (256ms)4,160 samples256ms at 16kHz
Kokoro (5s variant)Variable tokens~5s output
Kokoro (15s variant)Variable tokens~15s output

Step 2: Validate

Parity scripts run the PyTorch and CoreML models side-by-side on identical inputs, comparing outputs numerically and measuring latency.
uv run python compare-components.py compare \
  --output-dir ./build \
  --model-id nvidia/parakeet-tdt-0.6b-v3 \
  --runs 10 --warmup 3
This produces:
  • Numerical diff — max absolute error, max relative error, match/no-match per component
  • Latency comparison — Torch CPU vs CoreML (CPU+ANE) with speedup ratios
  • Plots — visual comparisons saved to plots/ directory
  • metadata.json — structured results for CI and reporting

Example Parity Results (Parakeet TDT v3)

ComponentMax Abs ErrorMatchTorch CPUCoreML ANESpeedup
Encoder0.005Yes1030ms25ms40x
Preprocessor0.484Yes2.0ms1.2ms1.7x
DecodertoleranceYes7.5ms4.3ms1.7x
Joint0.099Yes28ms23ms1.3x

Step 3: Quantize (Optional)

Quantization reduces model size and can improve latency on ANE. The sweep evaluates multiple strategies and reports the trade-offs.
uv run python quantize_coreml.py \
  --input-dir ./build \
  --output-root ./build_quantized \
  --compute-units ALL --runs 10

Quantization Strategies

StrategySize ReductionQuality ImpactBest For
INT8 per-channel~2x smallerMinimal lossGeneral deployment
INT8 per-tensor~2x smallerSignificant loss on large modelsSmall models only
6-bit palettization~2.5x smallerVaries by modelSize-constrained devices
Results are saved to quantization_summary.json with per-component quality scores (1.0 = identical to baseline).

Common CoreML Modifications

PyTorch models often need modifications for CoreML tracing. Common patterns:
PyTorch FeatureCoreML Fix
pack_padded_sequenceExplicit LSTM states + masking
Dynamic shapes / loopsFixed shapes, broadcasting
In-place operationsPure functional transforms
Random generationDeterministic inputs passed externally
Complex number opsReal/imaginary split

Adding a New Model

  1. Create the directory: models/{class}/{name}/coreml/
  2. Add pyproject.toml with dependencies
  3. Write convert-*.py — export script
  4. Write compare-*.py — validation script (optional but recommended)
  5. Add README.md documenting the conversion
  6. Push converted weights to Hugging Face
mkdir -p models/stt/my-new-model/coreml
cd models/stt/my-new-model/coreml

# Initialize with uv
uv init
uv add coremltools torch

# Write your conversion script
# ... convert-my-model.py

Deployment Targets

  • Minimum: iOS 17 / macOS 14
  • Format: MLProgram (.mlpackage for development, .mlmodelc for compiled)
  • Compute units: Models traced with CPU_ONLY for determinism; runtime compute units set when loading (.cpuAndNeuralEngine, .cpuAndGPU, .all)