VRAM-Relay is a high-performance infrastructure designed to virtualize and offload GPU memory (VRAM) from mobile devices to remote compute servers.
Built for the era of local-first AI, it enables Android devices to run heavyweight models (LLMs, Stable Diffusion, vision models) by transparently leveraging the raw GPU power of a nearby PC or server.
The server node is responsible for GPU memory management and AI execution.
-
VRAM Manager Dynamically allocates and releases GPU buffers based on client sessions and model requirements.
-
Socket Engine An asynchronous, high-throughput TCP server capable of handling multiple concurrent inference streams.
-
Discovery Provider A lightweight UDP broadcast service allowing instant detection of relay nodes by mobile devices on the local network.
The Android client is optimized for minimal latency and seamless integration.
-
Native Bridge Ultra-fast C/C++ layer (JNI) to reduce serialization and network overhead.
-
Discovery Service A background Android Service that scans the local Wi-Fi network for available relay nodes.
-
Transparent API A simplified Java/Kotlin interface for loading models, sending prompts, and retrieving tensors or text outputs.
-
Pure Binary Protocol No JSON, no HTTP — minimal overhead with deterministic performance.
-
Fixed 16-byte Header Includes packet type, payload size, sequencing, and CRC checksum.
-
Network Optimization Optional LZ4 compression for large tensor transfers and model weights.
- OS: Linux (Ubuntu 22.04+ recommended)
- Hardware: NVIDIA GPU (Pascal architecture or newer)
- Software: Docker & NVIDIA Container Toolkit
# Clone the repository
git clone https://github.com/your-repo/vram-relay.git
cd vram-relay
# Build the optimized Docker image
sudo docker build -t vram-relay-node -f docker/Dockerfile .
# Run the relay node
# Ports:
# - 8765 TCP: data & inference
# - 8766 UDP: service discovery
sudo docker run --gpus all \
-p 8765:8765/tcp \
-p 8766:8766/udp \
--restart unless-stopped \
--name vram-relay \
vram-relay-node- Open the
android-client/directory in Android Studio. - Ensure network permissions are present in
AndroidManifest.xml(already included). - Build the project:
./gradlew assembleDebugThe protocol is optimized for near real-time responsiveness.
| Operation | Latency (ms) | Effective Throughput |
|---|---|---|
| UDP Discovery | < 100 ms | N/A |
| RTT (Ping / Pong) | 2 – 8 ms | < 1 KB |
| Model Loading | 200 – 1200 ms | Up to 120 MB/s |
| LLM Inference | 50 – 250 ms | GPU-dependent |
VRAMRelayClient client = new VRAMRelayClient(context);
client.initialize();
// 1. Automatic server discovery
client.discoverServers(3000, servers -> {
if (!servers.isEmpty()) {
// 2. Connect to the first available GPU node
client.connect(servers.get(0));
// 3. Load a model into remote VRAM
client.loadModel("llama-3-8b-q4", true);
}
});
⚠️ Important
-
Local Network Only The protocol is unencrypted by default to maximize throughput. For Internet access, always use a secure VPN tunnel (WireGuard or Tailscale).
-
Process Isolation Docker confines the relay process and limits access to the host file system.
-
Timeout Handling Idle connections automatically release allocated VRAM after 5 minutes of inactivity.
- Multi-GPU support (NVLink / SLI)
- Token-by-token streaming inference
- Adaptive compression based on Wi-Fi signal quality
- Native plugins for PyTorch and Hugging Face
This project is distributed under the MIT License.
Contributions are welcome via Pull Requests. Bug reports, performance improvements, and protocol extensions are highly encouraged.
VRAM-Relay aims to break the VRAM barrier and bring truly powerful AI inference to mobile and constrained devices — without compromise.