Add configurable timeout for health check recency queries #964

jwils · 2025-12-30T01:26:57Z

When the datastore is slow, health check requests can take a long time
to complete. Since Envoy doesn't cache in-flight health check results,
it sends additional concurrent requests while waiting, which compounds
the load on an already struggling datastore.

This adds a timeout_ms configuration option (default: 5000ms) that
sets a monotonic clock deadline on recency queries. Requests will fail
fast when the datastore is unresponsive, preventing the cascade of
concurrent health check requests.

Configuration example:

health_check:
  timeout_ms: 3000
  data_recency_checks:
    Widget:
      timestamp_field: created_at
      expected_max_recency_seconds: 30

When the datastore is slow, health check requests can take a long time to complete. Since Envoy doesn't cache in-flight health check results, it sends additional concurrent requests while waiting, which compounds the load on an already struggling datastore. This adds a `timeout_ms` configuration option (default: 5000ms) that sets a monotonic clock deadline on recency queries. Requests will fail fast when the datastore is unresponsive, preventing the cascade of concurrent health check requests. Configuration example: ```yaml health_check: timeout_ms: 3000 data_recency_checks: Widget: timestamp_field: created_at expected_max_recency_seconds: 30

myronmarston

Since Envoy doesn't cache in-flight health check results, it sends additional concurrent requests while waiting, which compounds the load on an already struggling datastore.

A timeout is one way to improve things here but I think there are some other ways to improve things:

IIRC, there's a way to setup caching at the API gateway layer so that the envoy doesn't send compounding requests. IIRC, there was a SEV related to this in 2024 or 2025 and I thought another team using ElasticGraph figured out a solution related to API gateway configuration. (I'm fuzzy on the details so maybe it wasn't an API gateway solution, but I know a team ran into this).
I think it could make sense for HealthChecker#check_health to do its own caching internally with its own cache expiration logic. e.g. if performing a data_recency_check with expected_max_recency_seconds of 30 and the most recent record is 2 seconds old, we should be able to respond for 28 more seconds without actually querying for a recent record again. Likewise, we could cache the cluster health checks for some period of time.

myronmarston · 2025-12-30T08:33:43Z

elasticgraph-health_check/spec/unit/elastic_graph/health_check/health_checker_spec.rb

+
+          # The deadline should be now_in_ms (1000) + timeout_ms (3000) = 4000
+          # We verify this by checking the query was built with the timeout header
+          expect(datastore_query_body_by_type).to have_key("Widget")


I don't understand how the presence of Widget in the query body demonstrates that the configured timeout_ms was passed as a monotonic_clock_deadline.

In fact, when I applied this change:

diff --git a/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb b/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb index 9289966..38e2d78 100644 --- a/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb +++ b/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb @@ -92,7 +92,7 @@ module ElasticGraph requested_fields: ["id", recency_config.timestamp_field], document_pagination: {first: 1}, sort: [{recency_config.timestamp_field => {"order" => "desc"}}], - monotonic_clock_deadline: monotonic_clock_deadline + # monotonic_clock_deadline: monotonic_clock_deadline ) end

...nothing in this spec file failed.

myronmarston · 2025-12-30T08:34:07Z

elasticgraph-health_check/spec/unit/elastic_graph/health_check/health_checker_spec.rb

+          expect(datastore_query_body_by_type).to have_key("Widget")
+        end
+
+        it "uses the default timeout_ms of 5000 when not specified" do


It doesn't seem like this test does what the description says.

myronmarston · 2025-12-30T08:34:29Z

elasticgraph-health_check/spec/unit/elastic_graph/health_check/health_checker_spec.rb

+          expect(datastore_query_body_by_type).to have_key("Widget")
+        end
+
+        it "allows configuring a custom timeout_ms value" do


It doesn't seem like this test validates that the custom timeout value is used.

myronmarston · 2025-12-30T08:36:58Z

elasticgraph-health_check/spec/unit/elastic_graph/health_check/health_checker_spec.rb

+      end
+
+      describe "error handling" do
+        it "logs a warning and returns empty results when the datastore raises a GraphQL::ExecutionError" do


I don't understand what this has to do with the rest of this PR. You haven't implemented anything in this PR that rescues GraphQL::ExecutionError, logs a warning and returns an empty result in that case....so why test it?

Are you trying to validate what happens when the timeout is exceeded? If so, I think it would be better to test that in elasticgraph-health_check/spec/acceptance/health_checker_spec.rb. To deterministically cause a timeout you could have a test that sets the timeout_ms to 0 or 1, and then demonstrate the expected http response.

jwils · 2025-12-30T15:47:14Z

IIRC, there's a way to setup caching at the API gateway layer so that the envoy doesn't send compounding requests. IIRC, there was a SEV related to this in 2024 or 2025 and I thought another team using ElasticGraph figured out a solution related to API gateway configuration. (I'm fuzzy on the details so maybe it wasn't an API gateway solution, but I know a team ran into this).

I believe this code just caches the response at the api gateway layer. That helps things significantly, but I don't think solves the issue of compounding requests as when the cache is invalid multiple simultaneous requests can be sent until there is a new cached value. Looking to this even more, I think this doesn't work at all when the cluster is unhealthy as I think it doesn't cache non 2xx responses.

I think it could make sense for HealthChecker#check_health to do its own caching internally with its own cache expiration logic. e.g. if performing a data_recency_check with expected_max_recency_seconds of 30 and the most recent record is 2 seconds old, we should be able to respond for 28 more seconds without actually querying for a recent record again. Likewise, we could cache the cluster health checks for some period of time.

This seems like a potentially good solution. Although I guess doesn't solve the unhealthy case either as we don't want to cache unhealthy values.

Regarding the tests - Apologies there. I didn't review them after generating them. Will update more carefully once I have a plan that might actually work.

I'm not sure this is actually an issue since the cluster gets unhealthy for another reason, but I see this metric where the number of requests goes way down but it looks like the search rate is significantly elevated. I think this elevation is from the health checks.

My thought process is that this may slow down recovery.

myronmarston · 2025-12-30T16:16:39Z

This seems like a potentially good solution. Although I guess doesn't solve the unhealthy case either as we don't want to cache unhealthy values.

I think we can cache unhealthy values, too. We won't want to cache them for as long (e.g. 28 seconds in the example I had above for the healthy case) but I think it would be reasonable to cache them for a configurable length of time. Even caching them for 1 second could be useful--if you're hammering the cluster with hundreds of thousands of searches per second as your graph shows, caching an unhealthy value with a timeout of 1 second could make a big difference, with very little impact on how much longer the endpoint reports "unhealthy" (up to a second).

jwils requested review from BrianSigafoos-SQ, ayousufi, bsorbo, jwondrusch, lancethomps, myronmarston, nicolewoch and thomasmahoney as code owners December 30, 2025 01:26

jwils force-pushed the joshuaw/timeout branch from a8658fc to d4758a0 Compare December 30, 2025 01:33

jwils force-pushed the joshuaw/timeout branch from d4758a0 to d2fae4e Compare December 30, 2025 02:17

myronmarston requested changes Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configurable timeout for health check recency queries #964

Add configurable timeout for health check recency queries #964

Uh oh!

jwils commented Dec 30, 2025

Uh oh!

myronmarston left a comment

Uh oh!

myronmarston Dec 30, 2025

Uh oh!

myronmarston Dec 30, 2025

Uh oh!

myronmarston Dec 30, 2025

Uh oh!

myronmarston Dec 30, 2025

Uh oh!

jwils commented Dec 30, 2025

Uh oh!

myronmarston commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add configurable timeout for health check recency queries #964

Are you sure you want to change the base?

Add configurable timeout for health check recency queries #964

Uh oh!

Conversation

jwils commented Dec 30, 2025

Uh oh!

myronmarston left a comment

Choose a reason for hiding this comment

Uh oh!

myronmarston Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

myronmarston Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

myronmarston Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

myronmarston Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

jwils commented Dec 30, 2025

Uh oh!

myronmarston commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants