Skip to content

feat: Minimal agent for AuthBridge OTEL (Approach A, zero custom observability)#122

Draft
Ladas wants to merge 10 commits intokagenti:mainfrom
Ladas:feat/otel-authbridge-minimal-agent-667
Draft

feat: Minimal agent for AuthBridge OTEL (Approach A, zero custom observability)#122
Ladas wants to merge 10 commits intokagenti:mainfrom
Ladas:feat/otel-authbridge-minimal-agent-667

Conversation

@Ladas
Copy link
Contributor

@Ladas Ladas commented Feb 13, 2026

Summary

Weather agent with zero custom observability code for Approach A (kagenti/kagenti#667).

Based on the pre-PR-114 weather agent, with only these additions:

  • W3C Trace Context propagation (so agent spans inherit AuthBridge root span)
  • OpenAI auto-instrumentation (for GenAI token metrics)
  • Dockerfile fix (Docker Hub base image)

The AuthBridge ext_proc creates the root span with all MLflow/OpenInference/GenAI attributes. Agent's LangChain and OpenAI auto-instrumented spans become children via traceparent header injection.

What the agent does NOT have (by design)

  • No observability.py module
  • No custom middleware
  • No mlflow.* attribute setting
  • No openinference.span.kind setting
  • No root span creation
  • No input/output capture

All of the above is handled by the AuthBridge ext_proc (kagenti/kagenti#668).

Related

Test plan

  • Deploy on HyperShift cluster with AuthBridge ext_proc
  • Run MLflow E2E tests with AGENT_OBSERVABILITY_VARIANT=authbridge
  • Verify traces appear in MLflow with correct root span attributes

🤖 Generated with Claude Code

Remove all OpenTelemetry code, auto-instrumentation, and tracing
middleware from the weather agent. All observability is now handled
externally by the AuthBridge ext_proc sidecar which:
- Creates root spans from A2A request/response
- Creates nested child spans from SSE stream events
- Sets all MLflow/OpenInference/GenAI attributes

Removed:
- observability.py (TracerProvider, middleware, span helpers)
- OTEL dependencies (opentelemetry-*, openinference-*)
- Tracing middleware from Starlette app
- All span creation/enrichment in agent code

The agent is now a plain A2A agent with zero observability overhead.

Refs kagenti/kagenti#667

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
@Ladas Ladas force-pushed the feat/otel-authbridge-minimal-agent-667 branch from 4c05994 to e9642b2 Compare February 17, 2026 12:27
Ladas and others added 7 commits February 17, 2026 14:10
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Wrap execute() with asyncio.shield() so LangGraph continues running
even when the SSE client disconnects. Event emission errors from the
closed connection are caught silently — the agent completes the task
and stores the result in the task store regardless.

This enables the ext_proc to recover the full trace via
tasks/resubscribe after the original client disconnects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Add _cancelled flag to distinguish between SSE disconnect (continue
execution) and explicit tasks/cancel (propagate cancel to task).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
When the final event emission fails due to client disconnect, save
the completed task with artifacts directly to the InMemoryTaskStore.
This allows the ext_proc's tasks/resubscribe to find the completed
result and capture the output for the trace.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Serialize LangChain messages via model_dump() and json.dumps() instead
of Python str(). This produces valid JSON that the ext_proc can parse
to extract GenAI semantic convention attributes (token counts, model
name, tool names) without regex.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
The emit_event(final=True) enqueues to EventQueue successfully even
after SSE disconnect, but the consumer is gone so the task store
never gets the artifact. Save the completed task with artifact
directly to the store after every execution, not just on failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Artifact constructor requires artifactId field. Generate a UUID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Ladas and others added 2 commits February 18, 2026 09:08
Support three instrumentation modes via OTEL_INSTRUMENT env var:
- none: no agent-side instrumentation (ext_proc sidecar only)
- openinference: LangChainInstrumentor (OpenInference conventions)
- openai: OpenAIInstrumentor (gen_ai.* conventions)

Auto-instrumented spans become children of the ext_proc root span
via W3C traceparent propagation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Without ASGI middleware, the LangChain/OpenAI auto-instrumentors
create orphaned traces (separate from the ext_proc root span).

Add OpenTelemetryMiddleware that extracts the traceparent header
injected by the ext_proc sidecar and sets it as the active trace
context. Auto-instrumented spans now become proper children of
the ext_proc root span.

The ASGI middleware is only enabled when OTEL_INSTRUMENT != none.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Copy link
Contributor

@pdettori pdettori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big-picture review

The observability simplification itself looks good — moving root span ownership to the ext_proc sidecar is the right direction. Four concerns to address before merging, details in inline comments.


from a2a.server.agent_execution import AgentExecutor, RequestContext
from a2a.server.apps import A2AStarletteApplication
from a2a.server.events.event_queue import EventQueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 PR description should reflect all changes in this PR

Beyond the observability simplification, this PR also introduces:

  • asyncio.shield() to survive SSE client disconnects (new resilience pattern)
  • Direct task-store saving as a fallback when SSE consumer is gone (new data-flow path)
  • _serialize_event() for JSON-serialized LangGraph events (new wire format for ext_proc parsing)
  • A working cancel() implementation (previously raise Exception)

These are all fine to include here, but the PR summary should list them so reviewers (and future git log readers) know what shipped.

@@ -91,7 +111,36 @@ class WeatherExecutor(AgentExecutor):
"""
A class to handle weather assistant execution for A2A Agent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Concurrency bug: _cancelled is shared across concurrent requests

WeatherExecutor is instantiated once in run() and shared across all concurrent requests. The _cancelled flag is reset to False at the top of every execute() call:

self._cancelled = False
task = asyncio.ensure_future(self._do_execute(context, event_queue))

If two requests are in flight:

  • Request A is executing, user sends tasks/cancelself._cancelled = True
  • Request B starts execute()self._cancelled = False (clobbers A's cancel)
  • Request A's SSE disconnects → CancelledError is caught, but _cancelled is now False, so the cancel is treated as an SSE disconnect instead of an explicit cancel

This needs to be per-request state, e.g. a dict[str, bool] keyed by task ID or context ID.


# Initialize observability before importing agent
setup_observability()
All tracing and observability is handled externally by the AuthBridge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Docstring contradicts pyproject.toml

This says "No OTEL dependencies needed in the agent" but pyproject.toml adds five OTEL packages (opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http, openinference-instrumentation-langchain, opentelemetry-instrumentation-openai, opentelemetry-instrumentation-asgi) and observability.py is still ~110 lines of active setup code.

This is "minimal" instrumentation, not "zero". The docstring (and PR title) should be corrected to avoid confusion when comparing Approach A vs. alternatives.

async def _do_execute(self, context: RequestContext, event_queue: EventQueue):
"""
The agent allows to retrieve weather info through a natural language conversational interface
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Silent except Exception: pass hides real bugs

This pattern appears in several places in this PR (here, line 165, line 184). While the intent — survive SSE client disconnects — is valid, swallowing all exceptions silently (including serialization bugs, type errors, SDK contract violations) will make production debugging very difficult.

At minimum, log at warning level instead of pass or debug, so these are visible in production logs without enabling debug verbosity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants