A Python application that controls your VTube Studio model's expressions using real-time voice recognition. It listens to your microphone, transcribes your speech, and triggers facial expressions in VTube Studio when specific keywords are detected.
This document provides a detailed overview of the project's architecture, data flow, and technologies, intended to help developers understand the system's inner workings.
- Real-time Voice Recognition: Listens to your microphone and transcribes speech locally.
- Keyword-Based Expression Triggering: Triggers expressions in VTube Studio when specific keywords are detected.
- Automatic Expression Sync: On startup, automatically discovers expressions from your VTube Studio model and updates the configuration file.
- Flexible Keyword System: Supports both custom keywords from the config file and the expression names from VTube Studio.
- Spam Prevention: Prevents the same expression from being spammed by enforcing a cooldown if triggered twice consecutively.
- GPU Acceleration: Can leverage a CUDA-enabled GPU for faster transcription if
onnxruntime-gpuis installed.
This project uses an asynchronous, event-driven architecture. The main components and the technologies they use are outlined below.
| Component | Technology / Library | Key File(s) | Description |
|---|---|---|---|
| VTS Communication | pyvts |
vts_client.py |
Handles all communication with the VTube Studio API, including connection, authentication, and hotkey requests. Uses asyncio.Lock to prevent API request race conditions. |
| Voice Recognition (ASR) | sherpa-onnx |
voice_engine/recognizer.py |
A wrapper for the sherpa-onnx real-time speech-to-text engine. It processes audio data and returns transcribed text. |
| Audio Input | sounddevice |
vts_main.py |
Captures live audio from the default microphone into a buffer for processing. |
| Configuration | pyyaml |
vts_config.yaml, vts_main.py |
Manages application settings, including VTS connection details and keyword-to-expression mappings. |
| Asynchronous Execution | asyncio |
vts_main.py, vts_client.py |
The foundation of the application, managing concurrent operations like audio processing and API calls without blocking. |
| Logging | loguru |
vts_main.py |
Provides robust and configurable logging to a file (vts_controller.log). |
| Model Management | requests, tqdm |
voice_engine/utils.py |
Downloads and extracts the required sherpa-onnx ASR models on the first run. |
The application follows a clear, sequential flow from initialization to real-time operation.
- The
mainfunction invts_main.pyis the entry point. - It loads settings from
vts_config.yaml. If the file doesn't exist, a default one is created. loguruis configured to write logs tovts_controller.log.
- The application checks if the
sherpa-onnxmodel is present in themodels/directory. - If not,
voice_engine.utils.ensure_model_downloaded_and_extracteddownloads and extracts the model files. - An instance of
VoiceRecognitionis created fromvoice_engine/recognizer.py. This class loads the ONNX model and prepares thesherpa_onnx.OnlineRecognizerfor transcription.
- An instance of
VTSClientis created. - It connects to the VTube Studio API via WebSocket and performs authentication. The authentication token is stored in
vts_token.txtby default.
- The application sends a request to VTS to get a list of all available hotkeys for the current model.
- It filters this list to find hotkeys of type
ToggleExpression. - It then compares this list with the expressions in
vts_config.yaml.- If an expression from VTS is not in the config file, it's added with a placeholder keyword (e.g.,
NEW_KEYWORD_MyExpression). - The
vts_config.yamlfile is updated on disk.
- If an expression from VTS is not in the config file, it's added with a placeholder keyword (e.g.,
- Finally, an in-memory
expression_mapis built. This dictionary maps both the user-defined keywords and the VTS expression names to their correspondinghotkeyID. This allows for flexible keyword detection.
This is the core real-time loop of the application.
- Audio Capture: An
sd.InputStreamfrom thesounddevicelibrary is opened. It continuously captures audio from the microphone in small chunks. - Buffering: A callback function (
audio_callback) is triggered for each audio chunk. This function appends the incoming audio data (a NumPy array) to a globalaudio_buffer. Access to this buffer is managed by anasyncio.Lockto prevent race conditions. - Periodic Processing: A separate asynchronous task,
process_audio_buffer_periodically, runs every 0.5 seconds. - Transcription: This task takes the current
audio_buffer, sends it to theasr_engine.transcribe_npmethod, and then clears the buffer. Thetranscribe_npmethod invoice_engine/recognizer.pyfeeds the audio into thesherpa-onnxengine, which returns the transcribed text if speech is detected.
- Callback: The transcribed text is passed to the
asr_callbackfunction. - Keyword Search: The function converts the text to lowercase and iterates through the
expression_mapto see if any keyword is present. - Spam Prevention: If a keyword is found, the system checks if the expression is on cooldown. An expression is put on a 60-second cooldown if it's triggered twice in a row.
- Trigger: If the expression is not on cooldown,
vts_client.trigger_expressionis called with thehotkeyID. This sends the final request to the VTube Studio API to activate the expression.
F:\VTS_Voice_Controller/
├── vts_main.py # Main application entry point, orchestrates all components.
├── vts_client.py # Client for all VTube Studio API interactions.
├── vts_config.yaml # Configuration file for VTS settings and expression keywords.
├── voice_engine/
│ ├── recognizer.py # Wrapper for the sherpa-onnx ASR engine.
│ └── utils.py # Utilities for downloading and managing ASR models.
├── models/ # Directory where ASR models are stored (created automatically).
├── requirements.txt # Python dependencies.
├── run.bat / run.sh # Convenience scripts for running the application.
└── README.md # This file.
-
Clone the Repository
git clone <your-repository-url> cd VTS_Voice_Controller
-
Create a Python Virtual Environment
python -m venv .venv
-
Activate the Environment
- On Windows:
.\.venv\Scripts\activate - On macOS/Linux:
source .venv/bin/activate
- On Windows:
-
Install Dependencies
pip install -r requirements.txt
(Note: For GPU support, you may need to manually install a specific version of
onnxruntime-gpu)
This project includes convenient scripts to run the application.
- On Windows (Command Prompt/PowerShell):
.\run.bat
- On macOS/Linux (or Git Bash on Windows):
(You may need to make the
./run.sh
.shscript executable first:chmod +x run.sh)
Alternatively, you can run the program manually after activating the virtual environment:
python vts_main.pyOn the first run, you will need to allow the plugin's authentication request inside VTube Studio. The ASR model will also be downloaded, which may take a few minutes.