Skip to content

fix: add 30s timeout to WebRTC connection_ready.wait() to prevent indefinite hang#1446

Open
hculap wants to merge 2 commits intodimensionalOS:devfrom
hculap:fix/webrtc-timeout-dev
Open

fix: add 30s timeout to WebRTC connection_ready.wait() to prevent indefinite hang#1446
hculap wants to merge 2 commits intodimensionalOS:devfrom
hculap:fix/webrtc-timeout-dev

Conversation

@hculap
Copy link

@hculap hculap commented Mar 6, 2026

Problem

Blueprint.build() and direct UnitreeWebRTCConnection(ip) hang indefinitely when the WebRTC SDP exchange fails. The root cause is connection_ready.wait() in connect() has no timeout — if the connection never establishes, the event is never set and the thread blocks forever with no error.

Common trigger: Go2 only allows one WebRTC client at a time. If the Unitree mobile app is open, the SDP exchange returns "sdp": "reject" and the event never fires. The process appears to hang with no indication of what went wrong.

Fix

Add a 30-second timeout with thread cleanup and a descriptive TimeoutError listing common causes:

CONNECT_TIMEOUT = 30  # seconds

if not self.connection_ready.wait(timeout=CONNECT_TIMEOUT):
    if self.loop.is_running():
        self.loop.call_soon_threadsafe(self.loop.stop)
    if self.thread.is_alive():
        self.thread.join(timeout=2.0)
    raise TimeoutError(
        f"WebRTC connection to {self.ip} timed out after {CONNECT_TIMEOUT}s. "
        "Common causes:\n"
        "  - Another WebRTC client is connected (close the Unitree mobile app)\n"
        "  - Robot is unreachable on the network\n"
        "  - Port 9991 (encrypted SDP) is not responding\n"
        "Tip: only one WebRTC client can connect to the Go2 at a time."
    )

Testing

Tested on: Go2 Pro, stock firmware, STA mode (home WiFi), DimOS v0.0.10.post2, macOS Apple Silicon.

  • Normal connection: connects as before, no behavior change
  • Mobile app connected: raises TimeoutError after 30s with helpful message
  • Robot unreachable: raises TimeoutError after 30s

Closes #1438 (previous PR targeting main — same fix, retargeted to dev as requested).

…efinite hang

Blueprint.build() and direct UnitreeWebRTCConnection() hang forever when
the WebRTC SDP exchange fails (port closed, another client connected,
robot unreachable). connection_ready.wait() has no timeout, blocking the
subprocess/main thread indefinitely with no error message.

Add a 30-second timeout with thread cleanup and a descriptive TimeoutError
listing common causes and troubleshooting steps.

Common trigger: Go2 only allows one WebRTC client at a time. If the Unitree
mobile app is connected, the SDP exchange returns 'reject' and the event
never fires. Error now surfaces immediately instead of hanging forever.

Tested: Go2 Pro, stock firmware, STA mode (home WiFi), DimOS v0.0.10.post2
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR fixes a real and well-documented hang in UnitreeWebRTCConnection.connect() by adding a 30-second timeout to the blocking threading.Event.wait() call, along with basic thread cleanup and a descriptive TimeoutError. The core fix is correct and the happy-path behaviour is unchanged.

Key findings:

  • Missing conn.disconnect() and task.cancel() in the timeout path — unlike the existing stop() and disconnect() methods, the timeout handler does not cancel the pending async_connect() task or schedule await self.conn.disconnect(). This can leave the robot-side WebRTC session open, blocking an immediate retry — the very scenario the fix is meant to address.
  • CONNECT_TIMEOUT defined as a method-local constant — moving this to a class attribute (e.g. CONNECT_TIMEOUT: int = 30) would allow callers on slower/faster networks to customise it without modifying source.
  • Silent thread survival after join(timeout=2.0) deadline — if the background thread is still alive after the join completes, no warning is emitted; a quick retry could race with the lingering thread.

Confidence Score: 3/5

  • Safe to merge as a net improvement over infinite hang, but the incomplete cleanup in the timeout path may cause issues on retry.
  • The fix resolves the reported infinite-hang correctly and normal connection flow is unaffected. The score is reduced because the timeout cleanup path omits task.cancel() and conn.disconnect(), which are present in every other cleanup code path and their absence could cause the robot-side connection to remain open, recreating the original problem on a retry.
  • dimos/robot/unitree/connection.py — specifically lines 120–132 (timeout cleanup block)

Important Files Changed

Filename Overview
dimos/robot/unitree/connection.py Adds a 30 s timeout to connection_ready.wait() with partial cleanup; however, the timeout path does not cancel the pending asyncio task or call conn.disconnect(), potentially leaving the robot-side WebRTC session open for retry attempts.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant connect()
    participant BackgroundThread
    participant asyncio Loop
    participant Robot (WebRTC)

    Caller->>connect(): call connect()
    connect()->>asyncio Loop: new_event_loop()
    connect()->>BackgroundThread: thread.start()
    BackgroundThread->>asyncio Loop: create_task(async_connect())
    BackgroundThread->>asyncio Loop: loop.run_forever()

    asyncio Loop->>Robot (WebRTC): conn.connect() (SDP exchange)

    alt Connection succeeds (happy path)
        Robot (WebRTC)-->>asyncio Loop: SDP accepted
        asyncio Loop->>connect(): connection_ready.set()
        connect()->>Caller: returns normally
    else Timeout after 30 s (e.g. SDP rejected / unreachable)
        connect()->>asyncio Loop: call_soon_threadsafe(loop.stop)
        Note over connect(): ⚠ task NOT cancelled<br/>⚠ conn.disconnect() NOT called
        connect()->>BackgroundThread: thread.join(timeout=2.0)
        Note over connect(): thread may still be alive if join times out
        connect()->>Caller: raises TimeoutError
        Note over Robot (WebRTC): WebRTC session may still be<br/>open on the robot side
    end
Loading

Last reviewed commit: 18f76d5

Comment on lines +120 to +132
if not self.connection_ready.wait(timeout=CONNECT_TIMEOUT):
if self.loop.is_running():
self.loop.call_soon_threadsafe(self.loop.stop)
if self.thread.is_alive():
self.thread.join(timeout=2.0)
raise TimeoutError(
f"WebRTC connection to {self.ip} timed out after {CONNECT_TIMEOUT}s. "
"Common causes:\n"
" - Another WebRTC client is connected (close the Unitree mobile app)\n"
" - Robot is unreachable on the network\n"
" - Port 9991 (encrypted SDP) is not responding\n"
"Tip: only one WebRTC client can connect to the Go2 at a time."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing conn.disconnect() and task.cancel() in timeout cleanup

When the timeout fires, the cleanup stops the event loop but never cancels the pending async_connect() task or calls self.conn.disconnect(). Compare this with the existing stop() and disconnect() methods, which always cancel the task and schedule await self.conn.disconnect() before stopping the loop.

Without calling self.conn.disconnect(), the robot-side WebRTC peer connection may remain open or in a "connecting" state, potentially blocking the very retry attempt the user will make after seeing the TimeoutError — the same root cause as the original bug. It can also emit a Python runtime warning: Task was destroyed but it is pending!.

The cleanup block should mirror the pattern used in stop():

if not self.connection_ready.wait(timeout=CONNECT_TIMEOUT):
    if self.loop.is_running():
        if self.task is not None:
            self.loop.call_soon_threadsafe(self.task.cancel)

        async def async_disconnect() -> None:
            try:
                await self.conn.disconnect()
            except Exception:
                pass

        asyncio.run_coroutine_threadsafe(async_disconnect(), self.loop)
        self.loop.call_soon_threadsafe(self.loop.stop)

    if self.thread.is_alive():
        self.thread.join(timeout=2.0)

    raise TimeoutError(...)

self.task = self.loop.create_task(async_connect())
self.loop.run_forever()

CONNECT_TIMEOUT = 30 # seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONNECT_TIMEOUT should be a class-level attribute

Defining the timeout as a method-local constant makes it invisible at the class API level and impossible for callers to customise without patching the source. Users on slower or congested networks might need a longer window; users in CI want a shorter one.

A class attribute is both more discoverable and overridable:

Suggested change
CONNECT_TIMEOUT = 30 # seconds
CONNECT_TIMEOUT = 30 # seconds — override on the class or instance if needed

Or better, promote it to the class body:

class UnitreeWebRTCConnection(Resource):
    CONNECT_TIMEOUT: int = 30  # seconds
    ...

and reference it as self.CONNECT_TIMEOUT inside connect().

Comment on lines +123 to +124
if self.thread.is_alive():
self.thread.join(timeout=2.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background thread may still be alive when TimeoutError is raised

thread.join(timeout=2.0) is non-blocking past its deadline — if the thread has not stopped within 2 s (e.g. a blocking C-extension call inside conn.disconnect()), execution falls through silently and TimeoutError is raised while the daemon thread is still running.

If a caller catches the TimeoutError and immediately tries to reconnect, the still-alive thread sharing self.loop / self.conn could cause a race condition. Consider logging a warning when the join times out:

if self.thread.is_alive():
    self.thread.join(timeout=2.0)
    if self.thread.is_alive():
        import warnings
        warnings.warn(
            "Background WebRTC thread did not stop within 2s after timeout cleanup.",
            RuntimeWarning,
            stacklevel=2,
        )

Address review feedback: the timeout cleanup path was missing task.cancel()
and conn.disconnect(), which could leave a half-open WebRTC session on the
robot side. A subsequent connect() call would then time out again for the
same reason, recreating the original infinite-hang problem.

Cleanup sequence on timeout:
1. Cancel the async_connect task (CancelledError handled)
2. Call conn.disconnect() if conn was partially initialised
3. Stop the event loop
4. Join the background thread (2s timeout)
5. Raise TimeoutError with diagnostic message
@hculap
Copy link
Author

hculap commented Mar 6, 2026

Thanks for the review! Good catch — pushed a fixup commit (2aefe0c) that adds task.cancel() + conn.disconnect() to the timeout cleanup path. The cleanup now runs via run_coroutine_threadsafe so it executes in the WebRTC event loop, then the loop is stopped and thread joined. This should prevent leaving a half-open session on the robot side on timeout/retry.

@hculap
Copy link
Author

hculap commented Mar 13, 2026

Hey @lesh-dimensionalOS 👋 — gentle ping on this one. You confirmed the fix resolves the robot motion issue back in early March. Any chance this can get reviewed and merged into dev? Happy to rebase or add tests if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant