-
Notifications
You must be signed in to change notification settings - Fork 18
Add configurable timeout for health check recency queries #964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a8658fc to
d4758a0
Compare
When the datastore is slow, health check requests can take a long time
to complete. Since Envoy doesn't cache in-flight health check results,
it sends additional concurrent requests while waiting, which compounds
the load on an already struggling datastore.
This adds a `timeout_ms` configuration option (default: 5000ms) that
sets a monotonic clock deadline on recency queries. Requests will fail
fast when the datastore is unresponsive, preventing the cascade of
concurrent health check requests.
Configuration example:
```yaml
health_check:
timeout_ms: 3000
data_recency_checks:
Widget:
timestamp_field: created_at
expected_max_recency_seconds: 30
d4758a0 to
d2fae4e
Compare
myronmarston
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Envoy doesn't cache in-flight health check results, it sends additional concurrent requests while waiting, which compounds the load on an already struggling datastore.
A timeout is one way to improve things here but I think there are some other ways to improve things:
- IIRC, there's a way to setup caching at the API gateway layer so that the envoy doesn't send compounding requests. IIRC, there was a SEV related to this in 2024 or 2025 and I thought another team using ElasticGraph figured out a solution related to API gateway configuration. (I'm fuzzy on the details so maybe it wasn't an API gateway solution, but I know a team ran into this).
- I think it could make sense for
HealthChecker#check_healthto do its own caching internally with its own cache expiration logic. e.g. if performing adata_recency_checkwithexpected_max_recency_secondsof30and the most recent record is 2 seconds old, we should be able to respond for 28 more seconds without actually querying for a recent record again. Likewise, we could cache the cluster health checks for some period of time.
|
|
||
| # The deadline should be now_in_ms (1000) + timeout_ms (3000) = 4000 | ||
| # We verify this by checking the query was built with the timeout header | ||
| expect(datastore_query_body_by_type).to have_key("Widget") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how the presence of Widget in the query body demonstrates that the configured timeout_ms was passed as a monotonic_clock_deadline.
In fact, when I applied this change:
diff --git a/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb b/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb
index 9289966..38e2d78 100644
--- a/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb
+++ b/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb
@@ -92,7 +92,7 @@ module ElasticGraph
requested_fields: ["id", recency_config.timestamp_field],
document_pagination: {first: 1},
sort: [{recency_config.timestamp_field => {"order" => "desc"}}],
- monotonic_clock_deadline: monotonic_clock_deadline
+ # monotonic_clock_deadline: monotonic_clock_deadline
)
end
...nothing in this spec file failed.
| expect(datastore_query_body_by_type).to have_key("Widget") | ||
| end | ||
|
|
||
| it "uses the default timeout_ms of 5000 when not specified" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like this test does what the description says.
| expect(datastore_query_body_by_type).to have_key("Widget") | ||
| end | ||
|
|
||
| it "allows configuring a custom timeout_ms value" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like this test validates that the custom timeout value is used.
| end | ||
|
|
||
| describe "error handling" do | ||
| it "logs a warning and returns empty results when the datastore raises a GraphQL::ExecutionError" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this has to do with the rest of this PR. You haven't implemented anything in this PR that rescues GraphQL::ExecutionError, logs a warning and returns an empty result in that case....so why test it?
Are you trying to validate what happens when the timeout is exceeded? If so, I think it would be better to test that in elasticgraph-health_check/spec/acceptance/health_checker_spec.rb. To deterministically cause a timeout you could have a test that sets the timeout_ms to 0 or 1, and then demonstrate the expected http response.
I think we can cache unhealthy values, too. We won't want to cache them for as long (e.g. 28 seconds in the example I had above for the healthy case) but I think it would be reasonable to cache them for a configurable length of time. Even caching them for 1 second could be useful--if you're hammering the cluster with hundreds of thousands of searches per second as your graph shows, caching an unhealthy value with a timeout of 1 second could make a big difference, with very little impact on how much longer the endpoint reports "unhealthy" (up to a second). |

When the datastore is slow, health check requests can take a long time
to complete. Since Envoy doesn't cache in-flight health check results,
it sends additional concurrent requests while waiting, which compounds
the load on an already struggling datastore.
This adds a
timeout_msconfiguration option (default: 5000ms) thatsets a monotonic clock deadline on recency queries. Requests will fail
fast when the datastore is unresponsive, preventing the cascade of
concurrent health check requests.
Configuration example: