-
Notifications
You must be signed in to change notification settings - Fork 12
Monitoring module to expose channels' data through HTTP APIs #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
29f48b0 to
9940716
Compare
9940716 to
69aee05
Compare
pool-apps/pool/config-examples/mainnet/pool-config-bitcoin-core-ipc-example.toml
Outdated
Show resolved
Hide resolved
69aee05 to
c33f6dd
Compare
...apps/jd-client/config-examples/mainnet/jdc-config-bitcoin-core-ipc-hosted-infra-example.toml
Show resolved
Hide resolved
| ] | ||
|
|
||
| # Monitoring HTTP server address for exposing channel data | ||
| monitoring_address = "0.0.0.0:9091" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be overthinking it, but I’d suggest binding this to 127.0.0.1 by default. Since we don’t have rate limiting or authentication in place yet, exposing it more broadly could make some endpoints vulnerable to abuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDC is expected to run on "trusted" LAN environments
but I guess Pool is somewhat vulnerable to this indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the Pool's config examples use 127.0.0.1 in 3a030ff
c33f6dd to
f1c7700
Compare
This comment was marked as resolved.
This comment was marked as resolved.
|
Turned this PR into draft, since I noticed apps are hanging when I try to do ctrl+c. Furthermore, I'm considering the usage of Axum as a web framework instead of the manual approach I initially took. |
1b0ff12 to
4a3c1f6
Compare
|
@lucasbalieiro @gimballock this PR is ready for review again. I refactored the http server to use Axum, and I added support to Prometheus, by exporting metrics at the |
| /// Handler for Prometheus metrics endpoint | ||
| /// | ||
| /// Collects fresh data from all monitoring sources and exports as Prometheus metrics | ||
| async fn handle_prometheus_metrics(State(state): State<ServerState>) -> Response { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure whether this is intentional, as I’m not fully familiar with how Prometheus handles this data. However, this endpoint does not appear to be cleaned up properly: it continues to retain and serve data for disconnected channels.
This behavior also makes the endpoint susceptible to abuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics retain per-channel data (id, user) that should be cleaned up with metric.remove_label_values() when a channel is closed:
- sv2_client_channel_hashrate
- sv2_client_shares_accepted_total
- sv2_client_channel_shares_per_minute
You could collect that info in ChannelManager.remove_downstream() but how do you pass it to the monitoring layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed improvements about this.
I basically added a reset of the three mentioned metrics at the beginning of handle_prometheus_metrics().
In this way we're always gonna show currently opened channels' data through the Prometheus endpoint as well.
pool-apps/pool/src/lib/monitoring.rs
Outdated
| let extended = self.get_client_extended_channels(); | ||
| let standard = self.get_client_standard_channels(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these functions loop over downstream clients to count channels and hashrate.
You could combine them to collect these stats in a single pass.
Additionally, if you collect all the clients in one pass and then collect the stats in a second pass then you can avoid having nested locks as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that adding other helper functions called internally by these two can become something too convoluted for this purpose.
If you have a concrete suggestion to improve this, I'm open to it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to replace those helpers:
fn get_client_summary(&self) -> ChannelsWithClientSummary {
// Collect client references
let clients: Vec<_> = self
.channel_manager_data
.safe_lock(|d| {
d.downstream
.values()
.map(|client| client.downstream_data.clone())
.collect()
})
.unwrap_or_default();
// Count channels and hashrate
let mut extended_channels = 0usize;
let mut standard_channels = 0usize;
let mut total_hashrate = 0.0f32;
for downstream_data in clients {
downstream_data
.safe_lock(|dd| {
for (_channel_id, extended_channel) in dd.extended_channels.iter() {
extended_channels += 1;
total_hashrate += extended_channel.get_nominal_hashrate();
}
for (_channel_id, standard_channel) in dd.standard_channels.iter() {
standard_channels += 1;
total_hashrate += standard_channel.get_nominal_hashrate();
}
})
.unwrap_or_default();
}
ChannelsWithClientSummary {
total_channels: extended_channels + standard_channels,
extended_channels,
standard_channels,
total_hashrate,
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem is the nested lock, we should probably remove it also from the other two method implementations (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the trait, and now it has only one method to implement, which is called get_clients().
And then there's another method already implemented called get_clients_summary() which internally calls the get_clients(), to remove redundancy of the operations.
Let me know your thoughts about this new structure.
bf17aef to
a855e0b
Compare
7eaba9a to
3754e27
Compare
23e32a8 to
167b4d9
Compare
|
would it make sense to add some Integration Tests to assert the correctness of these new monitoring APIs? deploy some apps, do HTTP requests against their endpoints, then parse and assert against the data I'd say this suggestion is not necessarily a blocker for this PR, since I'm doing some manual but it might be a good idea for a follow-up, since we might unintentionally break these new APIs in the future, so having some CI feels like a good long-term strategy? |
167b4d9 to
e7a272c
Compare
Opened an issue for it: #161 |
e7a272c to
a18ff20
Compare
|
@lucasbalieiro I adapted the docker configs in c26abb5. Please check everything is correct there. |
…f every Sv2 application
…or client info conversion
c26abb5 to
fd245fc
Compare
This PR introduces a new module called
monitoringto thestratum-appscrate.This new module defines the necessary structures which are used to represent every channels' data already retrievable through the
channels_sv2getter methods.It also defines traits with methods to serve these data through the HTTP APIs, implemented by our applications.
It also adds support to Prometheus, by exporting metrics at
/metricsendpoint.Closes #105