Skip to content

πŸš€ @ruvector/ruvllm-wasm v2.0.0 β€” Browser-native LLM inference is hereΒ #242

@ruvnet

Description

@ruvnet

@ruvector/ruvllm-wasm v2.0.0 β€” Browser-Native AI Runtime

The first functional release of @ruvector/ruvllm-wasm is live on npm. This replaces the deprecated v0.1.0 placeholder with a fully compiled 435KB WebAssembly binary β€” semantic routing, adaptive learning, KV cache management, and chat template formatting running entirely client-side.

No server. No API keys. No network latency.

Full documentation: crates/ruvllm-wasm/README.md


Install

npm install @ruvector/ruvllm-wasm

Quick start

import init, { ChatTemplateWasm, ChatMessageWasm, HnswRouterWasm, healthCheck } from '@ruvector/ruvllm-wasm';

await init();
console.log(healthCheck()); // true

// Format conversations for any model family
const template = ChatTemplateWasm.detectFromModelId("meta-llama/Llama-3-8B");
const prompt = template.format([
  ChatMessageWasm.system("You are a helpful assistant."),
  ChatMessageWasm.user("What can you do?"),
]);

// Route queries to the best agent in <1ms using HNSW semantic search
const router = new HnswRouterWasm(384, 1000);
router.addPattern(embedding, "code-agent", "handles code generation");
router.addPattern(embedding2, "research-agent", "handles research queries");
const result = router.route(queryEmbedding);
console.log(result.name, result.score); // "code-agent" 0.95

// Persist and restore router state
const snapshot = router.toJson();
const restored = HnswRouterWasm.fromJson(snapshot);

What ships in this release

Feature Type Description
HNSW Semantic Router HnswRouterWasm Bidirectional graph with cosine similarity β€” 150x faster than linear scan. Add patterns, route queries, serialize to JSON.
KV Cache KvCacheWasm Two-tier token cache (FP32 tail for recent tokens + u8 quantized store for history). Configurable tail length, head count, head dimension.
Chat Templates ChatTemplateWasm Format prompts for 7 model families: Llama3, Llama2, Mistral, Qwen, ChatML, Phi, Gemma. Auto-detect from model ID.
MicroLoRA MicroLoraWasm Per-request LoRA adaptation (rank 1-4). Forward pass, gradient accumulation, weight updates β€” all in <1ms. Serialize/restore adapter state.
SONA Instant SonaInstantWasm EMA quality tracking, adaptive rank adjustment, EWC-lite weight importance, pattern buffer with cosine-similarity suggestion.
Memory Pool InferenceArenaWasm O(1) bump allocator for inference temporaries. Pre-size for model dimensions. Tracks high-water mark.
Buffer Pool BufferPoolWasm Pre-allocated buffers in size classes (1KB-256KB). Configurable max per class. Pool hit rate tracking.
Web Workers ParallelInference Async parallel matmul, attention, layerNorm across multiple workers. SharedArrayBuffer detection, capability-level reporting.
Feature Detection Free functions feature_summary(), optimal_worker_count(), supports_parallel_inference(), is_simd_available(), cross_origin_isolated()
TypeScript .d.ts Complete type definitions for all 20+ exported types and free functions

Package stats

Metric Value
WASM binary 435 KB (178 KB gzipped)
JS glue 128 KB
TypeScript defs 45 KB
Exported types 20+
Browser support Chrome 57+, Firefox 52+, Safari 11+, Edge 79+

Architecture

JavaScript/TypeScript App
        β”‚
        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ ruvllm_wasm β”‚  ← 435KB compiled WASM
  β”‚   .js glue  β”‚  ← wasm-bindgen generated
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                             β”‚
    β–Ό                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Core     β”‚  β”‚ Intelligent Features β”‚
β”‚          β”‚  β”‚                      β”‚
β”‚ KvCache  β”‚  β”‚ HnswRouter (150x)   β”‚
β”‚ Arena    β”‚  β”‚ MicroLoRA (<1ms)     β”‚
β”‚ BufPool  β”‚  β”‚ SONA Instant         β”‚
β”‚ Chat TPL β”‚  β”‚ Web Workers          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Build from source

# Requires: rustup target add wasm32-unknown-unknown && cargo install wasm-pack

# Release build (workaround for Rust 1.91 codegen bug)
CARGO_PROFILE_RELEASE_CODEGEN_UNITS=256 CARGO_PROFILE_RELEASE_LTO=off \
  wasm-pack build crates/ruvllm-wasm --target web --scope ruvector --release

# Dev build (no workaround needed)
wasm-pack build crates/ruvllm-wasm --target web --scope ruvector --dev

See ADR-084 for build details and known limitations.

What's next

  • IntelligentLLM β€” Combined router + LoRA + SONA in one unified API
  • WebGPU attention β€” GPU-accelerated attention (matmul GPU path already works)
  • GGUF streaming loader β€” Load quantized models directly in the browser
  • Worker completion signals β€” Replace setTimeout polling with message-based coordination

Related packages

Package Version Description
@ruvector/ruvllm 2.5.2 Node.js LLM orchestration with SONA learning
ruvector 0.2.11 Full CLI with 48 commands + 91 MCP tools
@ruvector/rvf β€” Cognitive container runtime

PR: #241 | ADR: ADR-084 | Docs: crates/ruvllm-wasm/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions