Skip to content

Conversation

@sebsto
Copy link
Collaborator

@sebsto sebsto commented Jan 15, 2026

Fix test hangs caused by Pool cancellation race conditions

Summary

This PR fixes two related race conditions in Lambda+LocalServer+Pool.swift that were causing the test suite to hang approximately 10% of the time.

Problem

The test suite exhibited intermittent hangs (~10% frequency) due to two bugs in the Pool implementation:

  1. Individual task cancellation bug: When one task waiting for a specific requestId was cancelled, the cancellation handler would incorrectly cancel ALL waiting tasks instead of just the cancelled one.

  2. Server shutdown hang: When the server shut down, waiting continuations in the pools were never cancelled, causing handlers to wait indefinitely for responses that would never arrive.

Root Causes

Root Cause #1: Cancellation Handler Removes ALL Continuations

The onCancel handler in Pool._next() was removing all continuations from the waitingForSpecific dictionary when any single task was cancelled:

onCancel: {
    // BUG: Removes ALL continuations, not just the cancelled task's
    for continuation in state.waitingForSpecific.values {
        toCancel.append(continuation)
    }
    state.waitingForSpecific.removeAll()
}

This caused unrelated concurrent invocations to fail with CancellationError when one client cancelled their request.

Root Cause #2: No Pool Cleanup During Server Shutdown

When the server shut down (e.g., test completes), the task group was cancelled but the pools' waiting continuations were never notified. The /invoke endpoint handlers would continue waiting for responses that would never arrive because the Lambda function had stopped.

Solution

Fix #1: Only Remove Specific Continuation on Cancellation

Modified the cancellation handler to only remove the continuation for the specific cancelled task:

onCancel: {
    // Only remove THIS task's continuation
    let continuationToCancel = self.lock.withLock { state -> CheckedContinuation<T, any Error>? in
        if let requestId = requestId {
            return state.waitingForSpecific.removeValue(forKey: requestId)
        } else {
            let cont = state.waitingForAny
            state.waitingForAny = nil
            return cont
        }
    }
    
    continuationToCancel?.resume(throwing: CancellationError())
}

Fix #2: Add Pool Cleanup During Server Shutdown

Added cancelAll() method to the Pool class and call it during server shutdown:

func cancelAll() {
    let continuationsToCancel = self.lock.withLock { state -> [CheckedContinuation<T, any Error>] in
        var toCancel: [CheckedContinuation<T, any Error>] = []
        
        if let continuation = state.waitingForAny {
            toCancel.append(continuation)
            state.waitingForAny = nil
        }
        
        for continuation in state.waitingForSpecific.values {
            toCancel.append(continuation)
        }
        state.waitingForSpecific.removeAll()
        
        return toCancel
    }
    
    for continuation in continuationsToCancel {
        continuation.resume(throwing: CancellationError())
    }
}

Called during server shutdown:

let serverOrHandlerResult1 = await group.next()!
group.cancelAll()

// Cancel all waiting continuations in the pools to prevent hangs
server.invocationPool.cancelAll()
server.responsePool.cancelAll()

Changes

Modified Files

  • Sources/AWSLambdaRuntime/HTTPServer/Lambda+LocalServer+Pool.swift

    • Fixed cancellation handler in _next() to only remove specific continuation
    • Added cancelAll() method for server shutdown cleanup
  • Sources/AWSLambdaRuntime/HTTPServer/Lambda+LocalServer.swift

    • Call cancelAll() on both pools during server shutdown

New Files

  • Tests/AWSLambdaRuntimeTests/LocalServerPoolCancellationTests.swift
    • Added comprehensive test suite with 3 tests
    • testCancellationOnlyAffectsOwnTask: Verifies only the cancelled task receives CancellationError
    • testConcurrentInvocationsWithCancellation: Tests real-world scenario with 5 concurrent invocations
    • testFIFOModeCancellation: Ensures FIFO mode cancellation works correctly

Testing

Before Fix

  • Test suite hung ~10% of the time
  • When 1 task was cancelled, all 5 concurrent tasks received CancellationError
  • Streaming tests would occasionally hang during shutdown

After Fix

  • All 91 tests pass consistently without hangs
  • When 1 task is cancelled, only that specific task receives CancellationError
  • Other tasks continue waiting normally
  • Server shutdown properly cleans up all waiting continuations
  • Multiple consecutive test runs confirm stability

Test Coverage

The new test suite reproduces both bugs and verifies the fixes:

  1. testCancellationOnlyAffectsOwnTask: Creates 3 tasks waiting for different requestIds, cancels only one, and verifies the others are not affected
  2. testConcurrentInvocationsWithCancellation: Simulates 5 concurrent invocations with one cancellation
  3. testFIFOModeCancellation: Tests FIFO mode to ensure it still works correctly

Sebastien Stormacq added 2 commits January 15, 2026 20:45
@sebsto sebsto changed the title Fix Test hanfs in Lambda+LocalServer (#630) Fix Test hangs in Lambda+LocalServer (#630) Jan 15, 2026
@sebsto sebsto self-assigned this Jan 15, 2026
@sebsto sebsto added the 🔨 semver/patch No public API change. label Jan 15, 2026
@sebsto sebsto merged commit 34e89b4 into main Jan 15, 2026
43 of 44 checks passed
@sebsto sebsto deleted the sebsto/fix_test_hangs branch January 15, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🔨 semver/patch No public API change.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant