-
Notifications
You must be signed in to change notification settings - Fork 9
Description
I've tried to run two Ralph sessions overnight and they've both been killed by out of memory by my VPS.
Have you had this experience?
Ralph was killed after using 10GB of memory. The OpenCode server had about 2.4GB.
10GB memory for an app that's basically a loop on another app seems excessive.
I've recently set up this VPS, so it's possible I'm doing something wrong and I'm considering a beefier machine next time with 32GB but it feels a bit unecessary seeing as my mac is also ARM and works fine on 16GB. Although I also don't have a swap file on the VPS - which I'm going to add. But I feel as though swap files don't solve potential memory leaks.
I got Claude to have a look at both the server and the source code and interrogated it a bit questioning it's reasoning etc. and it thinks there might be a memory leak. At this point I'm thinking that I'll experiment with just doing a bash loop so I'll leave you with results from Claude below. I hope this isn't falling foul of your recent blog post but I thought I'd share nonetheless.
👈 Memory Leak Investigation: Ralph Agent Process
Memory Leak Investigation: Ralph Agent Process
Incident Context
What Happened
The ralph agent process was killed by the Linux OOM (Out of Memory) killer after consuming excessive memory during a long-running session.
Key Facts:
- Process: ralph (agent process)
- Runtime: 2 hours 13 minutes before OOM kill
- Memory at death: 10.3 GB physical RAM, 102 GB virtual memory
- System: Hetzner CAX33 (16 GB RAM, ARM64 architecture, no swap)
- Pattern: Second OOM kill - previous one on Jan 7 consumed 6 GB after unknown runtime
Memory Growth Pattern
The ralph process started at normal baseline and grew from ~100 MB to 10.3 GB over approximately 2 hours. The massive discrepancy between physical (10.3 GB) and virtual (102 GB) memory suggests significant memory fragmentation and allocation churn.
System Context
- Long-running agent session working overnight
- Multiple other processes running concurrently:
- OpenCode server instances (~2.4 GB combined)
- Additional opencode instances (~1.2 GB each)
- next-server (2.2 GB)
- Agent operating in tmux session
- No swap space available, so OOM killer acted immediately when RAM exhausted
Critical Evidence: The Leak is Client-Side
The evidence clearly points to ralph's client-side memory accumulation, not the OpenCode server:
- 10.3 GB in ralph process - Memory attributed directly to ralph's process space
- OpenCode servers only 2.4 GB combined - If sessions were accumulating in the server, these processes would be much larger
- 102 GB virtual memory in ralph - Massive virtual allocation suggests ralph itself is doing heavy buffering/allocation with significant fragmentation
Root Cause Analysis
🔴 PRIMARY: SSE Event Stream Buffering with Full Tool Outputs
Location: src/loop.ts:200-268
Every SSE event from OpenCode contains the complete tool output data:
export type ToolStateCompleted = {
status: "completed";
input: { [key: string]: unknown };
output: string; // ⚠️ FULL OUTPUT (can be megabytes per tool call)
title: string;
attachments?: Array<FilePart>; // ⚠️ FILE ATTACHMENTS (can be huge)
metadata: { [key: string]: unknown };
time: { start: number; end: number; compacted?: number };
};The Problem:
-
Ralph subscribes to the global event stream:
const events = await client.event.subscribe();
-
Ralph consumes events but only extracts minimal data for display:
for await (const event of events.stream) { const part = event.properties.part; // Ralph only stores toolName and title for UI display callbacks.onEvent({ icon: toolName, text: part.state.title || JSON.stringify(part.state.input), // ⚠️ NOT using part.state.output }); }
-
However: The full
eventobjects remain in memory because:- The AsyncGenerator maintains references to yielded values
- The stream is never explicitly closed
- Each event contains the full tool output (file reads, grep results, web fetches)
- Ralph subscribes to ALL events globally, not just for its session
Memory Math:
- 50 iterations over 2 hours
- ~20 tool calls per iteration = 1,000 tool calls total
- Average tool output: 10 MB (large file reads, grep results, API responses)
- Total: 10 GB ✅
The 102 GB virtual memory comes from JavaScript string immutability causing allocation churn and fragmentation.
🟡 SECONDARY: Sessions Never Deleted
Location: src/loop.ts:189-196
Each iteration creates a new OpenCode session but never cleans it up:
const sessionResult = await client.session.create();
const sessionId = sessionResult.data.id;
// ... iteration runs ...
// ❌ Session is never deletedWhile the primary leak (10.3 GB) is in ralph's client-side code, this contributes to the 2.4 GB accumulated in OpenCode server processes. The SDK provides client.session.delete() but it's never called.
🟢 LOW PRIORITY: Other Potential Accumulation Points
Event Array Management
Location: src/state.ts:32-54, src/index.ts:409
Events are properly limited to MAX_EVENTS = 200 via trimEventsInPlace(), preventing unbounded growth. However:
- Array is mutated in-place with
splice()andpush() - Can cause minor fragmentation over time
- Each
ToolEventis a small object (~200 bytes)
Impact: Low - the 200 event limit prevents this from being significant.
Batch State Updater
Location: src/index.ts:46-116
The pendingUpdates array accumulates state update closures, but is properly cleared on flush (line 65). The closures capture LoopState including the events array, which could cause temporary memory spikes if batches grow large, but this is flushed every 100ms.
Impact: Low - proper cleanup occurs on flush.
Recommended Fixes
Fix #1: Close SSE Stream Properly ⭐ CRITICAL
Location: src/loop.ts after line 268 (after breaking from event loop)
// After breaking from the event loop
if (events.stream.return) {
await events.stream.return();
log("loop", "Event stream closed");
}Why: Explicitly closing the AsyncGenerator allows garbage collection of accumulated event objects.
Fix #2: Delete Sessions After Each Iteration
Location: src/loop.ts around line 300 (in finally block or after iteration completes)
// Iteration cleanup
try {
await client.session.delete({ path: { id: sessionId } });
log("loop", "Session cleaned up", { sessionId });
} catch (error) {
log("loop", "Failed to delete session", { sessionId, error: String(error) });
}Why: Prevents session accumulation in OpenCode server (contributes to 2.4 GB server-side growth).
Fix #3: Add Memory Monitoring
Location: src/index.ts after loop starts (around line 378)
import { startMemoryLogging } from "./util/log.js";
// After runLoop() is called
startMemoryLogging(30000); // Log memory every 30 secondsWhy: Provides visibility into memory growth patterns for debugging and verification. The logMemory() function already exists in src/util/log.ts:95-104 but isn't being called.
Fix #4: Force Garbage Collection Between Iterations (Optional)
Location: src/loop.ts after session cleanup, before starting next iteration
// After session deletion, before next iteration
if (global.gc) {
global.gc();
log("loop", "Forced GC after iteration", { iteration });
}Run ralph with: bun --expose-gc src/index.ts
Why: Proactively releases memory between iterations rather than waiting for automatic GC.
Verification Strategy
Confirm the Fix Works
- Add memory monitoring (Fix fix: resolve SSE race condition and add global config support #3) and observe memory growth every 30 seconds
- Implement streaming closure (Fix show the actual step text #1) and verify memory stabilizes
- Add event counter logging to track how many events are processed per iteration
- Run for 2+ hours and verify memory stays under 500 MB
Detailed Profiling (If Issues Persist)
-
Check if OpenCode supports event filtering by session
- Currently subscribing to global event stream
- May include events from other sessions/clients
-
Investigate OpenCode's compaction feature
ToolStateCompletedhastime.compactedfield- Suggests OpenCode can compact large tool outputs
- May need to enable or trigger manually
-
Use Bun heap snapshots
bun --heap-snapshot src/index.ts
Take snapshots at start, middle, and end of long runs to see exact allocation sites.
-
Monitor OpenCode server independently
- Watch OpenCode server process memory growth
- Verify session deletion actually frees memory server-side
Questions for Investigation
-
Does the OpenCode SDK's event subscription support filtering by
sessionIDto avoid receiving events from other sessions? -
Can we subscribe to events per session rather than globally to avoid the accumulation?
-
Does OpenCode support automatic compaction of large tool outputs, and if so, how do we enable it?
-
Are there any known issues with AsyncGenerator memory retention in Bun's JavaScript engine?
-
Should ralph maintain its own cache of recent tool outputs, or rely entirely on fetching from OpenCode on-demand?
Impact Assessment
Severity: High - Causes process termination after ~2 hours in production environments
Affected Users: Anyone running ralph for extended periods (multi-hour sessions)
Workaround: Manually restart ralph every 1-2 hours, or increase system RAM (not sustainable)
Risk of Regression: Medium - Fix #1 (stream closure) is a simple addition, but need to verify AsyncGenerator cleanup behavior
Timeline
- Jan 7, 2025: First OOM kill observed (6 GB consumption, unknown runtime)
- Jan 8, 2025: Second OOM kill (10.3 GB physical, 102 GB virtual, 2h 13m runtime)
- Jan 8, 2025: Root cause analysis identified SSE stream accumulation
Additional Context
System Specifications
- Cloud Provider: Hetzner Cloud
- Instance Type: CAX33
- RAM: 16 GB
- Architecture: ARM64 (aarch64)
- Swap: None configured
Ralph Configuration
- Model:
opencode/claude-opus-4-5 - Plan File:
plan.md - OpenCode Server: Shared instance (localhost:4190)
- Session Type: Long-running autonomous agent