-
Notifications
You must be signed in to change notification settings - Fork 19
Add process metrics sidecar with /info endpoint #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a separate HTTP server for process metrics and worker metadata, started by run.go after spawning SDK workers. This works for all language workers (Go/Python/Java/TS/.NET). The sidecar serves: - /metrics: Process metrics (CPU/memory) for the worker PID - /info: Worker metadata (sdk_version, build_id) The /info endpoint only returns fields that run-scenario doesn't already know. During metrics export, /info is fetched to populate additional fields in the parquet output.
28bf7ba to
bd291f5
Compare
- Add language field to /info endpoint and MetricLine parquet export - Add --prom-export-process-job flag (default: omes-worker-process) to query process metrics from a separate Prometheus job than SDK metrics - Pass language from runner to StartProcessMetricsSidecar
bd291f5 to
0ec4989
Compare
- Add --auth-header argument to .NET, TypeScript, Python, and Java workers - Fix TLS flag parsing in TypeScript and Python to accept string values - Add --build-id argument to non-Go workers
0ec4989 to
94c9c9a
Compare
This flag allows specifying the SDK version/ref to report in metrics separately from the --version flag (used for build-time SDK selection). The flag is: - Used by the sidecar's /info endpoint as sdk_version - NOT passed through to the actual worker process - Falls back to --version value if not specified This enables reporting the actual git ref (e.g., "master") in metrics when building from a local SDK directory (--version ./repo).
689aff8 to
237999d
Compare
The sdk_version from the sidecar's /info endpoint is now included in the exported parquet metrics file, making it available for analysis and dashboards.
1193946 to
f6e7c49
Compare
Sushisource
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly makes sense to me - an update to the README describing this architecture would be a good idea
cmd/clioptions/metrics.go
Outdated
| }) | ||
| }) | ||
|
|
||
| // Start server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boring comment (so are the other ones inside the func, really)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| // MetricsVersionTag is the SDK version/ref to report in metrics. | ||
| // This is used by the sidecar's /info endpoint and is NOT passed to the worker. | ||
| // If empty, falls back to the --version flag value. | ||
| MetricsVersionTag string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want this to be different from version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--version can be a path to a local repo (which is what we use in our cloud worker testing), we'd prefer a git commit hash or something that actually identifies the version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I see.
cmd/clioptions/metrics_test.go
Outdated
| // Give time for shutdown | ||
| time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should instead just use an eventually kind of thing on the subsequent assertion (also more boring comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with a util called eventually (assertion with a timeout)
|
updated README to include some info on the metrics sidecar |
What was changed
Add a separate HTTP server for process metrics and worker metadata, started when spawning SDK workers. This works for all language workers (Go/Python/Java/TS/.NET).
The sidecar serves:
The /info endpoint only returns fields that run-scenario doesn't already know. During metrics export, /info is fetched to populate additional fields in the parquet output.
Why?
We opt for a Prom server sidecar for process metrics because core-based SDKs have their existing Prom server started by core, making it difficult to:
Using a sidecar server to capture process metrics is simple.
We add an
/infoendpoint to this sidecar server so that the scenario runner can fetch additional worker metadata for metrics export. This is useful in scenarios where the scenario runner and the worker are running on different machines (as is the case in cloud worker testing).Added some basic integration tests