feat: Minimal agent for AuthBridge OTEL (Approach A, zero custom observability)#122
feat: Minimal agent for AuthBridge OTEL (Approach A, zero custom observability)#122Ladas wants to merge 10 commits intokagenti:mainfrom
Conversation
Remove all OpenTelemetry code, auto-instrumentation, and tracing middleware from the weather agent. All observability is now handled externally by the AuthBridge ext_proc sidecar which: - Creates root spans from A2A request/response - Creates nested child spans from SSE stream events - Sets all MLflow/OpenInference/GenAI attributes Removed: - observability.py (TracerProvider, middleware, span helpers) - OTEL dependencies (opentelemetry-*, openinference-*) - Tracing middleware from Starlette app - All span creation/enrichment in agent code The agent is now a plain A2A agent with zero observability overhead. Refs kagenti/kagenti#667 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
4c05994 to
e9642b2
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Wrap execute() with asyncio.shield() so LangGraph continues running even when the SSE client disconnects. Event emission errors from the closed connection are caught silently — the agent completes the task and stores the result in the task store regardless. This enables the ext_proc to recover the full trace via tasks/resubscribe after the original client disconnects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Add _cancelled flag to distinguish between SSE disconnect (continue execution) and explicit tasks/cancel (propagate cancel to task). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
When the final event emission fails due to client disconnect, save the completed task with artifacts directly to the InMemoryTaskStore. This allows the ext_proc's tasks/resubscribe to find the completed result and capture the output for the trace. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Serialize LangChain messages via model_dump() and json.dumps() instead of Python str(). This produces valid JSON that the ext_proc can parse to extract GenAI semantic convention attributes (token counts, model name, tool names) without regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
The emit_event(final=True) enqueues to EventQueue successfully even after SSE disconnect, but the consumer is gone so the task store never gets the artifact. Save the completed task with artifact directly to the store after every execution, not just on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Artifact constructor requires artifactId field. Generate a UUID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Support three instrumentation modes via OTEL_INSTRUMENT env var: - none: no agent-side instrumentation (ext_proc sidecar only) - openinference: LangChainInstrumentor (OpenInference conventions) - openai: OpenAIInstrumentor (gen_ai.* conventions) Auto-instrumented spans become children of the ext_proc root span via W3C traceparent propagation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Without ASGI middleware, the LangChain/OpenAI auto-instrumentors create orphaned traces (separate from the ext_proc root span). Add OpenTelemetryMiddleware that extracts the traceparent header injected by the ext_proc sidecar and sets it as the active trace context. Auto-instrumented spans now become proper children of the ext_proc root span. The ASGI middleware is only enabled when OTEL_INSTRUMENT != none. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>
pdettori
left a comment
There was a problem hiding this comment.
Big-picture review
The observability simplification itself looks good — moving root span ownership to the ext_proc sidecar is the right direction. Four concerns to address before merging, details in inline comments.
|
|
||
| from a2a.server.agent_execution import AgentExecutor, RequestContext | ||
| from a2a.server.apps import A2AStarletteApplication | ||
| from a2a.server.events.event_queue import EventQueue |
There was a problem hiding this comment.
🟡 PR description should reflect all changes in this PR
Beyond the observability simplification, this PR also introduces:
asyncio.shield()to survive SSE client disconnects (new resilience pattern)- Direct task-store saving as a fallback when SSE consumer is gone (new data-flow path)
_serialize_event()for JSON-serialized LangGraph events (new wire format for ext_proc parsing)- A working
cancel()implementation (previouslyraise Exception)
These are all fine to include here, but the PR summary should list them so reviewers (and future git log readers) know what shipped.
| @@ -91,7 +111,36 @@ class WeatherExecutor(AgentExecutor): | |||
| """ | |||
| A class to handle weather assistant execution for A2A Agent. | |||
There was a problem hiding this comment.
🔴 Concurrency bug: _cancelled is shared across concurrent requests
WeatherExecutor is instantiated once in run() and shared across all concurrent requests. The _cancelled flag is reset to False at the top of every execute() call:
self._cancelled = False
task = asyncio.ensure_future(self._do_execute(context, event_queue))
If two requests are in flight:
- Request A is executing, user sends
tasks/cancel→self._cancelled = True - Request B starts
execute()→self._cancelled = False(clobbers A's cancel) - Request A's SSE disconnects →
CancelledErroris caught, but_cancelledis nowFalse, so the cancel is treated as an SSE disconnect instead of an explicit cancel
This needs to be per-request state, e.g. a dict[str, bool] keyed by task ID or context ID.
|
|
||
| # Initialize observability before importing agent | ||
| setup_observability() | ||
| All tracing and observability is handled externally by the AuthBridge |
There was a problem hiding this comment.
🟡 Docstring contradicts pyproject.toml
This says "No OTEL dependencies needed in the agent" but pyproject.toml adds five OTEL packages (opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http, openinference-instrumentation-langchain, opentelemetry-instrumentation-openai, opentelemetry-instrumentation-asgi) and observability.py is still ~110 lines of active setup code.
This is "minimal" instrumentation, not "zero". The docstring (and PR title) should be corrected to avoid confusion when comparing Approach A vs. alternatives.
| async def _do_execute(self, context: RequestContext, event_queue: EventQueue): | ||
| """ | ||
| The agent allows to retrieve weather info through a natural language conversational interface | ||
| """ |
There was a problem hiding this comment.
🟡 Silent except Exception: pass hides real bugs
This pattern appears in several places in this PR (here, line 165, line 184). While the intent — survive SSE client disconnects — is valid, swallowing all exceptions silently (including serialization bugs, type errors, SDK contract violations) will make production debugging very difficult.
At minimum, log at warning level instead of pass or debug, so these are visible in production logs without enabling debug verbosity.
Summary
Weather agent with zero custom observability code for Approach A (kagenti/kagenti#667).
Based on the pre-PR-114 weather agent, with only these additions:
The AuthBridge ext_proc creates the root span with all MLflow/OpenInference/GenAI attributes. Agent's LangChain and OpenAI auto-instrumented spans become children via
traceparentheader injection.What the agent does NOT have (by design)
observability.pymodulemlflow.*attribute settingopeninference.span.kindsettingAll of the above is handled by the AuthBridge ext_proc (kagenti/kagenti#668).
Related
Test plan
AGENT_OBSERVABILITY_VARIANT=authbridge🤖 Generated with Claude Code