Skip to content

Conversation

@Shourya742
Copy link
Contributor

This PR adds basic profiling support to our long-running runtime servers so we can actually see what they’re doing while they’re running. We’re using hotpath to profile both latency and memory usage of roles. It gives a pretty nice UII where you can inspect things like flamegraphs, memory consumption, thread activity, async task polling, and even channel state (note: async_channel and broadcast channels aren’t supported yet). This is useful for roles, it helps spot which functions are taking most of the time, where performance regresses over time, and how memory behaves under load.

Profiling is behind feature flags, so there’s no overhead unless you turn it on:

  • hotpath → time / CPU profiling
  • hotpath-alloc → memory profiling (This only works in single threaded context)

To run:

cargo install hotpath --features='tui'
hotpath console

Once the role is running with the feature enabled, hotpath console will show a live view of what’s going on.

UI looks something like this:

Screenshot from 2025-12-30 15-07-51

@plebhash
Copy link
Member

plebhash commented Jan 3, 2026

when I first saw this PR, my first gut reaction was to skeptically think the following:

do we really need to add all this instrumentation overhead and keep maintaining it for the long run?
why not add it temporarily, investigate what needs to be investigated (e.g.: tProxy deadlock), and move on?

but after playing around with hotpath-rs a bit and reflecting further, I changed my mind a bit

most of the refactors are still relatively fresh, and there's still a lot of room for things like deadlocks, memory leaks and undefined states

so I feel it makes sense to add and keep this instrumentation to the codebase... perhaps not forever, maybe someday we can make the conscious decision to remove it, when we feel enough confidence on the code and there's no need to observe performance on the long run

but for now, I'd keep the code instrumented and ready to be analyzed whenever we feel there's some abnormalities manifesting (instead of having to take the extra time to instrument every time we want to make this kind of analysis)

so from a high level perspective, it's a concept ACK for me

but I still need to go through the code changes to better understand how we're achieving this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants