VTS Voice Controller

A Python application that controls your VTube Studio model's expressions using real-time voice recognition. It listens to your microphone, transcribes your speech, and triggers facial expressions in VTube Studio when specific keywords are detected.

This document provides a detailed overview of the project's architecture, data flow, and technologies, intended to help developers understand the system's inner workings.

Features

Real-time Voice Recognition: Listens to your microphone and transcribes speech locally.
Keyword-Based Expression Triggering: Triggers expressions in VTube Studio when specific keywords are detected.
Automatic Expression Sync: On startup, automatically discovers expressions from your VTube Studio model and updates the configuration file.
Flexible Keyword System: Supports both custom keywords from the config file and the expression names from VTube Studio.
Spam Prevention: Prevents the same expression from being spammed by enforcing a cooldown if triggered twice consecutively.
GPU Acceleration: Can leverage a CUDA-enabled GPU for faster transcription if onnxruntime-gpu is installed.

Architecture & Core Technologies

This project uses an asynchronous, event-driven architecture. The main components and the technologies they use are outlined below.

Component	Technology / Library	Key File(s)	Description
VTS Communication	`pyvts`	`vts_client.py`	Handles all communication with the VTube Studio API, including connection, authentication, and hotkey requests. Uses `asyncio.Lock` to prevent API request race conditions.
Voice Recognition (ASR)	`sherpa-onnx`	`voice_engine/recognizer.py`	A wrapper for the `sherpa-onnx` real-time speech-to-text engine. It processes audio data and returns transcribed text.
Audio Input	`sounddevice`	`vts_main.py`	Captures live audio from the default microphone into a buffer for processing.
Configuration	`pyyaml`	`vts_config.yaml`, `vts_main.py`	Manages application settings, including VTS connection details and keyword-to-expression mappings.
Asynchronous Execution	`asyncio`	`vts_main.py`, `vts_client.py`	The foundation of the application, managing concurrent operations like audio processing and API calls without blocking.
Logging	`loguru`	`vts_main.py`	Provides robust and configurable logging to a file (`vts_controller.log`).
Model Management	`requests`, `tqdm`	`voice_engine/utils.py`	Downloads and extracts the required `sherpa-onnx` ASR models on the first run.

Detailed Application Flow

The application follows a clear, sequential flow from initialization to real-time operation.

1. Initialization (`vts_main.py`)

The main function in vts_main.py is the entry point.
It loads settings from vts_config.yaml. If the file doesn't exist, a default one is created.
loguru is configured to write logs to vts_controller.log.

2. ASR Engine Setup (`vts_main.py` -> `voice_engine/`)

The application checks if the sherpa-onnx model is present in the models/ directory.
If not, voice_engine.utils.ensure_model_downloaded_and_extracted downloads and extracts the model files.
An instance of VoiceRecognition is created from voice_engine/recognizer.py. This class loads the ONNX model and prepares the sherpa_onnx.OnlineRecognizer for transcription.

3. VTube Studio Connection (`vts_main.py` -> `vts_client.py`)

An instance of VTSClient is created.
It connects to the VTube Studio API via WebSocket and performs authentication. The authentication token is stored in vts_token.txt by default.

4. Expression Synchronization (`vts_main.py`)

The application sends a request to VTS to get a list of all available hotkeys for the current model.
It filters this list to find hotkeys of type ToggleExpression.
It then compares this list with the expressions in vts_config.yaml.
- If an expression from VTS is not in the config file, it's added with a placeholder keyword (e.g., NEW_KEYWORD_MyExpression).
- The vts_config.yaml file is updated on disk.
Finally, an in-memory expression_map is built. This dictionary maps both the user-defined keywords and the VTS expression names to their corresponding hotkeyID. This allows for flexible keyword detection.

5. Audio Processing and Transcription (`vts_main.py`)

This is the core real-time loop of the application.

Audio Capture: An sd.InputStream from the sounddevice library is opened. It continuously captures audio from the microphone in small chunks.
Buffering: A callback function (audio_callback) is triggered for each audio chunk. This function appends the incoming audio data (a NumPy array) to a global audio_buffer. Access to this buffer is managed by an asyncio.Lock to prevent race conditions.
Periodic Processing: A separate asynchronous task, process_audio_buffer_periodically, runs every 0.5 seconds.
Transcription: This task takes the current audio_buffer, sends it to the asr_engine.transcribe_np method, and then clears the buffer. The transcribe_np method in voice_engine/recognizer.py feeds the audio into the sherpa-onnx engine, which returns the transcribed text if speech is detected.

6. Keyword Matching & Expression Triggering (`vts_main.py`)

Callback: The transcribed text is passed to the asr_callback function.
Keyword Search: The function converts the text to lowercase and iterates through the expression_map to see if any keyword is present.
Spam Prevention: If a keyword is found, the system checks if the expression is on cooldown. An expression is put on a 60-second cooldown if it's triggered twice in a row.
Trigger: If the expression is not on cooldown, vts_client.trigger_expression is called with the hotkeyID. This sends the final request to the VTube Studio API to activate the expression.

Project Structure

F:\VTS_Voice_Controller/
├── vts_main.py             # Main application entry point, orchestrates all components.
├── vts_client.py           # Client for all VTube Studio API interactions.
├── vts_config.yaml         # Configuration file for VTS settings and expression keywords.
├── voice_engine/
│   ├── recognizer.py       # Wrapper for the sherpa-onnx ASR engine.
│   └── utils.py            # Utilities for downloading and managing ASR models.
├── models/                 # Directory where ASR models are stored (created automatically).
├── requirements.txt        # Python dependencies.
├── run.bat / run.sh        # Convenience scripts for running the application.
└── README.md               # This file.

Setup and Installation

Clone the Repository

git clone <your-repository-url>
cd VTS_Voice_Controller

Create a Python Virtual Environment
```
python -m venv .venv
```
Activate the Environment
- On Windows: .\.venv\Scripts\activate
- On macOS/Linux: source .venv/bin/activate
Install Dependencies
```
pip install -r requirements.txt
```
(Note: For GPU support, you may need to manually install a specific version of onnxruntime-gpu)

Running the Application

This project includes convenient scripts to run the application.

On Windows (Command Prompt/PowerShell):
```
.\run.bat
```
On macOS/Linux (or Git Bash on Windows):
```
./run.sh
```
(You may need to make the .sh script executable first: chmod +x run.sh)

Alternatively, you can run the program manually after activating the virtual environment:

python vts_main.py

On the first run, you will need to allow the plugin's authentication request inside VTube Studio. The ASR model will also be downloaded, which may take a few minutes.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
agents		agents
config		config
core		core
inputs		inputs
tests		tests
ui		ui
.gitignore		.gitignore
GEMINI.md		GEMINI.md
README.md		README.md
VTS_Voice_Controller.spec		VTS_Voice_Controller.spec
VTS_Voice_Controller_portable.zip		VTS_Voice_Controller_portable.zip
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh
run_accurate.bat		run_accurate.bat
run_fast.bat		run_fast.bat
test.bat		test.bat
vts_config.yaml		vts_config.yaml
vts_controller.log		vts_controller.log
vts_main.py		vts_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTS Voice Controller

Features

Architecture & Core Technologies

Detailed Application Flow

1. Initialization (`vts_main.py`)

2. ASR Engine Setup (`vts_main.py` -> `voice_engine/`)

3. VTube Studio Connection (`vts_main.py` -> `vts_client.py`)

4. Expression Synchronization (`vts_main.py`)

5. Audio Processing and Transcription (`vts_main.py`)

6. Keyword Matching & Expression Triggering (`vts_main.py`)

Project Structure

Setup and Installation

Running the Application

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Jimmynycu/VTS_Voice_Controller

Folders and files

Latest commit

History

Repository files navigation

VTS Voice Controller

Features

Architecture & Core Technologies

Detailed Application Flow

1. Initialization (vts_main.py)

2. ASR Engine Setup (vts_main.py -> voice_engine/)

3. VTube Studio Connection (vts_main.py -> vts_client.py)

4. Expression Synchronization (vts_main.py)

5. Audio Processing and Transcription (vts_main.py)

6. Keyword Matching & Expression Triggering (vts_main.py)

Project Structure

Setup and Installation

Running the Application

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Initialization (`vts_main.py`)

2. ASR Engine Setup (`vts_main.py` -> `voice_engine/`)

3. VTube Studio Connection (`vts_main.py` -> `vts_client.py`)

4. Expression Synchronization (`vts_main.py`)

5. Audio Processing and Transcription (`vts_main.py`)

6. Keyword Matching & Expression Triggering (`vts_main.py`)

Packages