Skip to content

A high-performance VRAM virtualization and relay system to offload AI inference (LLMs, Vision) from Android devices to remote GPU nodes via a custom low-latency binary protocol.

License

Notifications You must be signed in to change notification settings

damienos61/VRAM-Relay-Node

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 VRAM-Relay Node

VRAM-Relay is a high-performance infrastructure designed to virtualize and offload GPU memory (VRAM) from mobile devices to remote compute servers.

Built for the era of local-first AI, it enables Android devices to run heavyweight models (LLMs, Stable Diffusion, vision models) by transparently leveraging the raw GPU power of a nearby PC or server.


🏗️ System Architecture

1. Relay Server (Linux / CUDA)

The server node is responsible for GPU memory management and AI execution.

  • VRAM Manager Dynamically allocates and releases GPU buffers based on client sessions and model requirements.

  • Socket Engine An asynchronous, high-throughput TCP server capable of handling multiple concurrent inference streams.

  • Discovery Provider A lightweight UDP broadcast service allowing instant detection of relay nodes by mobile devices on the local network.


2. Android Client (JNI / C++ / Java)

The Android client is optimized for minimal latency and seamless integration.

  • Native Bridge Ultra-fast C/C++ layer (JNI) to reduce serialization and network overhead.

  • Discovery Service A background Android Service that scans the local Wi-Fi network for available relay nodes.

  • Transparent API A simplified Java/Kotlin interface for loading models, sending prompts, and retrieving tensors or text outputs.


3. "VRAM-Link" Protocol

  • Pure Binary Protocol No JSON, no HTTP — minimal overhead with deterministic performance.

  • Fixed 16-byte Header Includes packet type, payload size, sequencing, and CRC checksum.

  • Network Optimization Optional LZ4 compression for large tensor transfers and model weights.


🛠️ Installation & Deployment

Server Requirements

  • OS: Linux (Ubuntu 22.04+ recommended)
  • Hardware: NVIDIA GPU (Pascal architecture or newer)
  • Software: Docker & NVIDIA Container Toolkit

1. Docker Deployment

# Clone the repository
git clone https://github.com/your-repo/vram-relay.git
cd vram-relay

# Build the optimized Docker image
sudo docker build -t vram-relay-node -f docker/Dockerfile .

# Run the relay node
# Ports:
#  - 8765 TCP: data & inference
#  - 8766 UDP: service discovery
sudo docker run --gpus all \
  -p 8765:8765/tcp \
  -p 8766:8766/udp \
  --restart unless-stopped \
  --name vram-relay \
  vram-relay-node

2. Android Client Setup

  1. Open the android-client/ directory in Android Studio.
  2. Ensure network permissions are present in AndroidManifest.xml (already included).
  3. Build the project:
./gradlew assembleDebug

📊 Performance & Latency (Wi-Fi 6 / LAN)

The protocol is optimized for near real-time responsiveness.

Operation Latency (ms) Effective Throughput
UDP Discovery < 100 ms N/A
RTT (Ping / Pong) 2 – 8 ms < 1 KB
Model Loading 200 – 1200 ms Up to 120 MB/s
LLM Inference 50 – 250 ms GPU-dependent

💻 Integration Example (Java)

VRAMRelayClient client = new VRAMRelayClient(context);
client.initialize();

// 1. Automatic server discovery
client.discoverServers(3000, servers -> {
    if (!servers.isEmpty()) {
        // 2. Connect to the first available GPU node
        client.connect(servers.get(0));

        // 3. Load a model into remote VRAM
        client.loadModel("llama-3-8b-q4", true);
    }
});

🛡️ Security & Best Practices

⚠️ Important

  • Local Network Only The protocol is unencrypted by default to maximize throughput. For Internet access, always use a secure VPN tunnel (WireGuard or Tailscale).

  • Process Isolation Docker confines the relay process and limits access to the host file system.

  • Timeout Handling Idle connections automatically release allocated VRAM after 5 minutes of inactivity.


🗺️ Roadmap

  • Multi-GPU support (NVLink / SLI)
  • Token-by-token streaming inference
  • Adaptive compression based on Wi-Fi signal quality
  • Native plugins for PyTorch and Hugging Face

📄 License & Contributions

This project is distributed under the MIT License.

Contributions are welcome via Pull Requests. Bug reports, performance improvements, and protocol extensions are highly encouraged.


VRAM-Relay aims to break the VRAM barrier and bring truly powerful AI inference to mobile and constrained devices — without compromise.

About

A high-performance VRAM virtualization and relay system to offload AI inference (LLMs, Vision) from Android devices to remote GPU nodes via a custom low-latency binary protocol.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published