Skip to content

Potential memory leak: 10GB ralph process #12

@connorads

Description

@connorads

I've tried to run two Ralph sessions overnight and they've both been killed by out of memory by my VPS.
Have you had this experience?
Ralph was killed after using 10GB of memory. The OpenCode server had about 2.4GB.
10GB memory for an app that's basically a loop on another app seems excessive.

I've recently set up this VPS, so it's possible I'm doing something wrong and I'm considering a beefier machine next time with 32GB but it feels a bit unecessary seeing as my mac is also ARM and works fine on 16GB. Although I also don't have a swap file on the VPS - which I'm going to add. But I feel as though swap files don't solve potential memory leaks.

I got Claude to have a look at both the server and the source code and interrogated it a bit questioning it's reasoning etc. and it thinks there might be a memory leak. At this point I'm thinking that I'll experiment with just doing a bash loop so I'll leave you with results from Claude below. I hope this isn't falling foul of your recent blog post but I thought I'd share nonetheless.

👈 Memory Leak Investigation: Ralph Agent Process

Memory Leak Investigation: Ralph Agent Process

Incident Context

What Happened

The ralph agent process was killed by the Linux OOM (Out of Memory) killer after consuming excessive memory during a long-running session.

Key Facts:

  • Process: ralph (agent process)
  • Runtime: 2 hours 13 minutes before OOM kill
  • Memory at death: 10.3 GB physical RAM, 102 GB virtual memory
  • System: Hetzner CAX33 (16 GB RAM, ARM64 architecture, no swap)
  • Pattern: Second OOM kill - previous one on Jan 7 consumed 6 GB after unknown runtime

Memory Growth Pattern

The ralph process started at normal baseline and grew from ~100 MB to 10.3 GB over approximately 2 hours. The massive discrepancy between physical (10.3 GB) and virtual (102 GB) memory suggests significant memory fragmentation and allocation churn.

System Context

  • Long-running agent session working overnight
  • Multiple other processes running concurrently:
    • OpenCode server instances (~2.4 GB combined)
    • Additional opencode instances (~1.2 GB each)
    • next-server (2.2 GB)
  • Agent operating in tmux session
  • No swap space available, so OOM killer acted immediately when RAM exhausted

Critical Evidence: The Leak is Client-Side

The evidence clearly points to ralph's client-side memory accumulation, not the OpenCode server:

  1. 10.3 GB in ralph process - Memory attributed directly to ralph's process space
  2. OpenCode servers only 2.4 GB combined - If sessions were accumulating in the server, these processes would be much larger
  3. 102 GB virtual memory in ralph - Massive virtual allocation suggests ralph itself is doing heavy buffering/allocation with significant fragmentation

Root Cause Analysis

🔴 PRIMARY: SSE Event Stream Buffering with Full Tool Outputs

Location: src/loop.ts:200-268

Every SSE event from OpenCode contains the complete tool output data:

export type ToolStateCompleted = {
    status: "completed";
    input: { [key: string]: unknown };
    output: string;              // ⚠️ FULL OUTPUT (can be megabytes per tool call)
    title: string;
    attachments?: Array<FilePart>; // ⚠️ FILE ATTACHMENTS (can be huge)
    metadata: { [key: string]: unknown };
    time: { start: number; end: number; compacted?: number };
};

The Problem:

  1. Ralph subscribes to the global event stream:

    const events = await client.event.subscribe();
  2. Ralph consumes events but only extracts minimal data for display:

    for await (const event of events.stream) {
      const part = event.properties.part;
      // Ralph only stores toolName and title for UI display
      callbacks.onEvent({
        icon: toolName,
        text: part.state.title || JSON.stringify(part.state.input),
        // ⚠️ NOT using part.state.output
      });
    }
  3. However: The full event objects remain in memory because:

    • The AsyncGenerator maintains references to yielded values
    • The stream is never explicitly closed
    • Each event contains the full tool output (file reads, grep results, web fetches)
    • Ralph subscribes to ALL events globally, not just for its session

Memory Math:

  • 50 iterations over 2 hours
  • ~20 tool calls per iteration = 1,000 tool calls total
  • Average tool output: 10 MB (large file reads, grep results, API responses)
  • Total: 10 GB

The 102 GB virtual memory comes from JavaScript string immutability causing allocation churn and fragmentation.


🟡 SECONDARY: Sessions Never Deleted

Location: src/loop.ts:189-196

Each iteration creates a new OpenCode session but never cleans it up:

const sessionResult = await client.session.create();
const sessionId = sessionResult.data.id;
// ... iteration runs ...
// ❌ Session is never deleted

While the primary leak (10.3 GB) is in ralph's client-side code, this contributes to the 2.4 GB accumulated in OpenCode server processes. The SDK provides client.session.delete() but it's never called.


🟢 LOW PRIORITY: Other Potential Accumulation Points

Event Array Management

Location: src/state.ts:32-54, src/index.ts:409

Events are properly limited to MAX_EVENTS = 200 via trimEventsInPlace(), preventing unbounded growth. However:

  • Array is mutated in-place with splice() and push()
  • Can cause minor fragmentation over time
  • Each ToolEvent is a small object (~200 bytes)

Impact: Low - the 200 event limit prevents this from being significant.

Batch State Updater

Location: src/index.ts:46-116

The pendingUpdates array accumulates state update closures, but is properly cleared on flush (line 65). The closures capture LoopState including the events array, which could cause temporary memory spikes if batches grow large, but this is flushed every 100ms.

Impact: Low - proper cleanup occurs on flush.


Recommended Fixes

Fix #1: Close SSE Stream Properly ⭐ CRITICAL

Location: src/loop.ts after line 268 (after breaking from event loop)

// After breaking from the event loop
if (events.stream.return) {
  await events.stream.return();
  log("loop", "Event stream closed");
}

Why: Explicitly closing the AsyncGenerator allows garbage collection of accumulated event objects.


Fix #2: Delete Sessions After Each Iteration

Location: src/loop.ts around line 300 (in finally block or after iteration completes)

// Iteration cleanup
try {
  await client.session.delete({ path: { id: sessionId } });
  log("loop", "Session cleaned up", { sessionId });
} catch (error) {
  log("loop", "Failed to delete session", { sessionId, error: String(error) });
}

Why: Prevents session accumulation in OpenCode server (contributes to 2.4 GB server-side growth).


Fix #3: Add Memory Monitoring

Location: src/index.ts after loop starts (around line 378)

import { startMemoryLogging } from "./util/log.js";

// After runLoop() is called
startMemoryLogging(30000); // Log memory every 30 seconds

Why: Provides visibility into memory growth patterns for debugging and verification. The logMemory() function already exists in src/util/log.ts:95-104 but isn't being called.


Fix #4: Force Garbage Collection Between Iterations (Optional)

Location: src/loop.ts after session cleanup, before starting next iteration

// After session deletion, before next iteration
if (global.gc) {
  global.gc();
  log("loop", "Forced GC after iteration", { iteration });
}

Run ralph with: bun --expose-gc src/index.ts

Why: Proactively releases memory between iterations rather than waiting for automatic GC.


Verification Strategy

Confirm the Fix Works

  1. Add memory monitoring (Fix fix: resolve SSE race condition and add global config support #3) and observe memory growth every 30 seconds
  2. Implement streaming closure (Fix show the actual step text #1) and verify memory stabilizes
  3. Add event counter logging to track how many events are processed per iteration
  4. Run for 2+ hours and verify memory stays under 500 MB

Detailed Profiling (If Issues Persist)

  1. Check if OpenCode supports event filtering by session

    • Currently subscribing to global event stream
    • May include events from other sessions/clients
  2. Investigate OpenCode's compaction feature

    • ToolStateCompleted has time.compacted field
    • Suggests OpenCode can compact large tool outputs
    • May need to enable or trigger manually
  3. Use Bun heap snapshots

    bun --heap-snapshot src/index.ts

    Take snapshots at start, middle, and end of long runs to see exact allocation sites.

  4. Monitor OpenCode server independently

    • Watch OpenCode server process memory growth
    • Verify session deletion actually frees memory server-side

Questions for Investigation

  1. Does the OpenCode SDK's event subscription support filtering by sessionID to avoid receiving events from other sessions?

  2. Can we subscribe to events per session rather than globally to avoid the accumulation?

  3. Does OpenCode support automatic compaction of large tool outputs, and if so, how do we enable it?

  4. Are there any known issues with AsyncGenerator memory retention in Bun's JavaScript engine?

  5. Should ralph maintain its own cache of recent tool outputs, or rely entirely on fetching from OpenCode on-demand?


Impact Assessment

Severity: High - Causes process termination after ~2 hours in production environments

Affected Users: Anyone running ralph for extended periods (multi-hour sessions)

Workaround: Manually restart ralph every 1-2 hours, or increase system RAM (not sustainable)

Risk of Regression: Medium - Fix #1 (stream closure) is a simple addition, but need to verify AsyncGenerator cleanup behavior


Timeline

  • Jan 7, 2025: First OOM kill observed (6 GB consumption, unknown runtime)
  • Jan 8, 2025: Second OOM kill (10.3 GB physical, 102 GB virtual, 2h 13m runtime)
  • Jan 8, 2025: Root cause analysis identified SSE stream accumulation

Additional Context

System Specifications

  • Cloud Provider: Hetzner Cloud
  • Instance Type: CAX33
  • RAM: 16 GB
  • Architecture: ARM64 (aarch64)
  • Swap: None configured

Ralph Configuration

  • Model: opencode/claude-opus-4-5
  • Plan File: plan.md
  • OpenCode Server: Shared instance (localhost:4190)
  • Session Type: Long-running autonomous agent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions