Skip to content

Add Apple Silicon (MPS) inference support [claude]#358

Open
mparrett wants to merge 4 commits intoyl4579:mainfrom
mparrett:mps-support
Open

Add Apple Silicon (MPS) inference support [claude]#358
mparrett wants to merge 4 commits intoyl4579:mainfrom
mparrett:mps-support

Conversation

@mparrett
Copy link

@mparrett mparrett commented Feb 26, 2026

Summary

  • Plumb use_fp16 through Decoder → Generator → SourceModuleHnNSF to support fp16 inference on MPS (fixes dtype mismatch when decoder runs in half precision)
  • Add Demo/inference_mps.py — clean MPS inference script with -p flag for one-shot benchmarking
  • Fix ref_textstexts variable name bug in LibriTTS demo notebook
  • Add pyproject.toml + lockfile for reproducible uv sync setup

MPS notes

TextEncoder must stay on CPU because MPS doesn't support pack_padded_sequence. All other modules run on MPS. After text encoding on CPU, tensors are transferred to the GPU device.

Scope is inference only — the training path has hardcoded .to('cuda') in the Decoder forward method which is a separate fix.

Usage

uv sync
USE_MPS=1 uv run python Demo/inference_mps.py -p "Hello world"

# With fp16 decoder (faster on some hardware)
USE_MPS=1 USE_FP16=1 uv run python Demo/inference_mps.py -p "Hello world"

Benchmarks (Apple Macbook Air M2, ~60 word passage → 17.7s audio)

Config Inference RTF vs CPU
CPU 6.1s 0.36
MPS 4.4s 0.25 1.4x faster
MPS + FP16 2.9s 0.16 2.1x faster

RTF = real-time factor (lower is better). MPS+FP16 sustains ~5x real-time synthesis.

If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one
near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class.

image

Test plan

  • Synthesized audio on Apple Silicon M1 with USE_MPS=1
  • Verify fp16 path with USE_MPS=1 USE_FP16=1
  • Verify CUDA path still works (no regressions — default use_fp16=False leaves all paths unchanged)
  • Verify istftnet decoder path is unaffected (no changes to that module)

🤖 Generated with Claude Code

mparrett and others added 4 commits February 25, 2026 22:07
Plumb use_fp16 parameter through Decoder → Generator → SourceModuleHnNSF
to support fp16 inference on MPS. Cast sine_wavs to the configured dtype
before the linear layer to prevent dtype mismatch when the decoder runs
in half precision.

Also remove unused f0_buf allocation in SineGen.forward.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean script for running StyleTTS 2 inference on Apple Silicon.
TextEncoder stays on CPU (MPS lacks pack_padded_sequence support), all
other modules run on MPS. Supports optional fp16 decoder via USE_FP16
env var.

Features:
- -p/--prompt flag for one-shot synthesis (useful for benchmarking)
- -r/--reference flag to specify reference audio
- Interactive text input loop when no prompt given
- RTF (real-time factor) timing on each synthesis

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename ref_texts to texts in the Style Transfer section to be consistent
with the variable name used in every other section of the notebook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirrors the existing requirements.txt with minimum version pins for
torch, torchaudio, and transformers. Adds phonemizer and scipy which
were missing from requirements.txt but needed at import time.

Enables reproducible setup via: uv sync

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mparrett mparrett changed the title Add Apple Silicon (MPS) inference support Add Apple Silicon (MPS) inference support [claude] Feb 26, 2026
@mparrett mparrett mentioned this pull request Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant