Skip to content

Add instrumentation and make system robust for actor restarts#128

Merged
dongwang218 merged 14 commits intomainfrom
add_instrumentation
Feb 3, 2026
Merged

Add instrumentation and make system robust for actor restarts#128
dongwang218 merged 14 commits intomainfrom
add_instrumentation

Conversation

@dongwang218
Copy link
Contributor

@dongwang218 dongwang218 commented Feb 3, 2026

Why ?

collect more instrumentation metrics. Make system robust for actor restarts.

How ?

  • instrumentation: add dequeue latency
  • when ENABLE_INSTRUMENTATION is changed to true, add the instrumentation metrics as part of the orchestrator. This is to collect detailed metrics as prometheus scrapping internval is about 1 second, not detailed enough. With this, we can call collect_instrumentation_metrics.py to calculate detailed latency, throughput and networking metrics.
  • Added dummy coral agents to study system bottleneck when agent is fast.
  • Use a background task to refresh the list of active actors, when agent actor restarts, we proactively refresh the Ray actors to avoid stale actors. Add local caching as the up to date mapping is kept in Sink, calling Sink to obtain the list is not scalable.
  • add dead_orchestrator_tracking to proactively timeout long running task in case they are dead. Use newly arrived task to timeout old ones and for late arrival, chang the timeout window to make it harder to timeout.

Test plan

Tested all commands same as #115

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 3, 2026
@dongwang218 dongwang218 changed the title Add instrumentation Add instrumentation and make system robust for actor restarts Feb 3, 2026
import pandas as pd


def load_jsonl(filepath: str) -> list[dict[str, Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same as the one in collect_agent_metrics.py?

@dongwang218 dongwang218 merged commit c2a5eb5 into main Feb 3, 2026
8 checks passed
@dongwang218 dongwang218 deleted the add_instrumentation branch February 3, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants