Skip to content

groxaxo/vibevoice-realtimeFASTAPI

Β 
Β 

Repository files navigation

πŸŽ™οΈ VibeVoice Realtime Runner

VibeVoice Python FastAPI OpenAI API

A high-performance local runner for Microsoft's VibeVoice Realtime text-to-speech model. Now with OpenAI-compatible API endpoints!

Features β€’ Quick Start β€’ API Documentation β€’ Credits


πŸš€ Features

  • Local & Private: Runs entirely on your machine (CUDA/MPS/CPU).
  • Realtime Streaming: Low-latency text-to-speech generation.
  • LavaSR Super-Resolution: Neural audio upsampling (24kHz β†’ 48kHz) at 300-500x realtime, enabled by default. Surpasses 6GB diffusion models in quality.
  • OpenAI API Compatible: Drop-in replacement for OpenAI's TTS API.
  • Multiple Audio Formats: Supports Opus (default), WAV, and MP3 output.
  • Web Interface: Built-in interactive demo UI.
  • Multi-Platform: Optimized for Ubuntu (CUDA) and macOS (Apple Silicon).
  • Easy Setup: Powered by uv for fast, reliable dependency management.

⚑ Quick Start

Prerequisites

  • uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh
  • Git
  • Hugging Face Account (for model download)

Installation

  1. Bootstrap the environment:

    ./scripts/bootstrap_uv.sh
  2. Download the model:

    uv run python scripts/download_model.py
  3. Run the server:

    uv run python scripts/run_realtime_demo.py --port 8000

🌐 Frontpage Controls (Web UI)

The frontpage at /web (also available at /) is fully connected to backend endpoints and exposes:

  • Model selection (tts-1, tts-1-hd) for OpenAI-compatible requests
  • Voice selection from GET /config and GET /v1/audio/voices
  • Temperature control (temp): set 0 to disable sampling
  • Generate Audio action that calls POST /v1/audio/speech
  • Download Audio action to export the generated file in the selected format (opus/wav/mp3)

πŸ“– API Documentation

This runner provides OpenAI-compatible endpoints for easy integration with existing tools and libraries.

πŸ—£οΈ Speech Generation

Endpoint: POST /v1/audio/speech

Generates audio from text with LavaSR super-resolution enabled by default (24kHz β†’ 48kHz). This is also the endpoint used by the frontpage "Generate Audio" button.

curl http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is VibeVoice running locally!",
    "voice": "en-Carter_man",
    "response_format": "opus"
  }' \
  --output speech.opus
Parameter Type Description
model string Model identifier (e.g., tts-1). Ignored but required for compatibility.
input string The text to generate audio for.
voice string The voice ID to use (see /v1/audio/voices).
response_format string Output format: opus (default, 48kHz), wav, or mp3.
temp float Sampling temperature. When provided (>0), enables sampling with the given temperature.
speed float Speed of generation (currently ignored).

🎀 List Voices

Endpoint: GET /v1/audio/voices

Returns a list of available voices.

curl http://127.0.0.1:8000/v1/audio/voices

Response:

{
  "voices": [
    {
      "id": "en-Carter_man",
      "name": "en-Carter_man",
      "object": "voice",
      "category": "vibe_voice",
      ...
    },
    ...
  ]
}

❀️ Health Check

Endpoint: GET /health

Returns basic service readiness information, including whether lazy loading is enabled and whether the model has already been initialized.

βš™οΈ Configuration

Device Selection

The runner automatically detects the best available device:

  • CUDA: NVIDIA GPUs (Linux)
  • MPS: Apple Silicon (macOS)
  • CPU: Fallback

To force a specific device:

uv run python scripts/run_realtime_demo.py --device cpu

Inference Steps

Specify the number of DDPM inference steps. Lower values reduce latency and improve realtime responsiveness. The default is 5 (official realtime profile).

uv run python scripts/run_realtime_demo.py --inference-steps 5

Custom Model Path

uv run python scripts/run_realtime_demo.py --model-path /path/to/model

LavaSR Audio Super-Resolution

LavaSR is enabled by default to upsample audio from 24kHz to 48kHz using neural network bandwidth extension. This provides studio-quality 48kHz audio output with minimal performance impact (~2ms per chunk).

To disable LavaSR (output will be 24kHz):

export ENABLE_LAVASR=false
uv run python scripts/run_realtime_demo.py

Or enable it explicitly:

export ENABLE_LAVASR=true
uv run python scripts/run_realtime_demo.py

Benefits of LavaSR:

  • Neural network super-resolution (not simple interpolation)
  • 300-500x realtime speed (~2ms latency per chunk)
  • Higher quality 48kHz audio output
  • Direct 24kHz β†’ 48kHz upsampling (no quality loss)
  • Quality surpasses 6GB diffusion models (best LSD scores)
  • Compatible with Opus format for optimal compression

Benchmark Results (RTX 3090):

Chunk Duration Upsampling Time Speed
0.25s 1.9ms 128x realtime
0.50s 1.9ms 263x realtime
1.00s 1.9ms 523x realtime
2.00s 2.1ms 961x realtime

πŸ“Š Realtime Benchmarking (/stream)

Use the websocket benchmark script to measure TTFA, chunk pacing, and RTF with reproducible settings.

uv run python scripts/benchmark_stream_endpoint.py \
  --ws-url ws://127.0.0.1:8000/stream \
  --voice en-Carter_man \
  --runs 10 \
  --temp 0 \
  --steps 5

The script writes a JSON report to /tmp by default and can compare against a prior run using --baseline-json.

πŸš€ Production Deployment

Important: This is a TTS-only service. Whisper transcription is not automatically launched. Whisper endpoints (if needed for validation) must be run separately.

Starting the Server

# Recommended: Use the provided script with GPU selection
CUDA_VISIBLE_DEVICES=2 uv run python scripts/run_realtime_demo.py --port 8000

# Or run the demo directly
CUDA_VISIBLE_DEVICES=2 uv run python third_party/VibeVoice/demo/vibevoice_realtime_demo.py \
  --port 8000 \
  --model_path models/VibeVoice-Realtime-0.5B \
  --device cuda \
  --inference_steps 5

Note: Replace CUDA_VISIBLE_DEVICES=2 with your available GPU. Check GPU availability with nvidia-smi.

Boot Autostart with Lazy Load

To install a systemd service that binds on all interfaces, listens on port 8881, and defers model initialization until the first speech request:

CUDA_VISIBLE_DEVICES=3 HOST=0.0.0.0 PORT=8881 ./scripts/install_systemd_service.sh
sudo systemctl start vibevoice-realtime.service

This exposes the UI at http://<your-host>:8881/web, keeps the OpenAI-compatible API under /v1/..., and sets ENABLE_LAZY_LOAD=true with ENABLE_STARTUP_WARMUP=false for fast boot-time startup. Adjust CUDA_VISIBLE_DEVICES if you want to pin the service to a different GPU.

Restarting with New Code

After pulling updates, restart the server to apply changes:

# Find and kill existing process
ps aux | grep vibevoice_realtime_demo
kill <PID>

# Restart with new code
CUDA_VISIBLE_DEVICES=2 uv run python scripts/run_realtime_demo.py --port 8000

πŸ”§ Recommended Concurrency

Based on end-to-end benchmarks (TTS + Whisper transcription), the recommended default concurrency is 2 concurrent requests.

Benchmark Results (RTX 3090, 5 inference steps):

Concurrency TTS avg/p95 (s) Whisper avg/p95 (s) E2E avg (s) Throughput (req/s)
2 5.57 / 9.11 0.39 / 0.66 5.96 0.333
4 11.15 / 14.55 0.43 / 0.82 11.58 0.324
8 20.86 / 27.11 0.43 / 0.81 21.29 0.322

Key Findings:

  • TTS is the bottleneck; Whisper adds minimal latency (~0.3-0.4s) regardless of concurrency
  • Throughput plateaus at ~0.32-0.33 req/s beyond 2 concurrent requests
  • Latency increases significantly with higher concurrency due to TTS queueing
  • Single-stream RTF: ~0.39 (2.6x faster than realtime)
  • Recommended max concurrent requests: 2 for optimal latency/throughput balance

🎧 Demos

All examples generated using 15 inference steps with text in the voice's native language.

English

Voice Audio Example (MP3)
en-Carter_man
en-Davis_man
en-Emma_woman
en-Frank_man
en-Grace_woman
en-Mike_man
in-Samuel_man

Other Languages

Language Voice Audio Example (MP3)
German de-Spk0_man
German de-Spk1_woman
Spanish sp-Spk0_woman
Spanish sp-Spk1_man
French fr-Spk0_man
French fr-Spk1_woman
Italian it-Spk0_woman
Italian it-Spk1_man
Japanese jp-Spk0_man
Japanese jp-Spk1_woman
Korean kr-Spk0_woman
Korean kr-Spk1_man
Dutch nl-Spk0_man
Dutch nl-Spk1_woman
Polish pl-Spk0_man
Polish pl-Spk1_woman
Portuguese pt-Spk0_woman
Portuguese pt-Spk1_man

πŸ† Credits & Acknowledgements

This project stands on the shoulders of giants. Huge thanks to:

  • Microsoft: For releasing the incredible VibeVoice model and the original codebase.
  • ysharma3501/LavaSR: For the high-quality neural audio super-resolution model.
  • groxaxo: For the original repository and initial setup.
  • Kokoro FastAPI Creators: For inspiration on the FastAPI implementation and structure.
  • Open Source Community: For all the tools and libraries that make this possible.

Made with ❀️ for the AI Community

About

Local runner for Microsoft VibeVoice Realtime TTS Fully compatible with Open-Webui Plug and Play. OpenAI api endpoint .Run the Colab notebook experience locally with uv package management.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 93.1%
  • HTML 3.9%
  • Shell 2.0%
  • Dockerfile 1.0%