Skip to content

Conversation

@jwils
Copy link
Collaborator

@jwils jwils commented Dec 30, 2025

When the datastore is slow, health check requests can take a long time
to complete. Since Envoy doesn't cache in-flight health check results,
it sends additional concurrent requests while waiting, which compounds
the load on an already struggling datastore.

This adds a timeout_ms configuration option (default: 5000ms) that
sets a monotonic clock deadline on recency queries. Requests will fail
fast when the datastore is unresponsive, preventing the cascade of
concurrent health check requests.

Configuration example:

health_check:
  timeout_ms: 3000
  data_recency_checks:
    Widget:
      timestamp_field: created_at
      expected_max_recency_seconds: 30

  When the datastore is slow, health check requests can take a long time
  to complete. Since Envoy doesn't cache in-flight health check results,
  it sends additional concurrent requests while waiting, which compounds
  the load on an already struggling datastore.

  This adds a `timeout_ms` configuration option (default: 5000ms) that
  sets a monotonic clock deadline on recency queries. Requests will fail
  fast when the datastore is unresponsive, preventing the cascade of
  concurrent health check requests.

  Configuration example:
  ```yaml
  health_check:
    timeout_ms: 3000
    data_recency_checks:
      Widget:
        timestamp_field: created_at
        expected_max_recency_seconds: 30
Copy link
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Envoy doesn't cache in-flight health check results, it sends additional concurrent requests while waiting, which compounds the load on an already struggling datastore.

A timeout is one way to improve things here but I think there are some other ways to improve things:

  • IIRC, there's a way to setup caching at the API gateway layer so that the envoy doesn't send compounding requests. IIRC, there was a SEV related to this in 2024 or 2025 and I thought another team using ElasticGraph figured out a solution related to API gateway configuration. (I'm fuzzy on the details so maybe it wasn't an API gateway solution, but I know a team ran into this).
  • I think it could make sense for HealthChecker#check_health to do its own caching internally with its own cache expiration logic. e.g. if performing a data_recency_check with expected_max_recency_seconds of 30 and the most recent record is 2 seconds old, we should be able to respond for 28 more seconds without actually querying for a recent record again. Likewise, we could cache the cluster health checks for some period of time.


# The deadline should be now_in_ms (1000) + timeout_ms (3000) = 4000
# We verify this by checking the query was built with the timeout header
expect(datastore_query_body_by_type).to have_key("Widget")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the presence of Widget in the query body demonstrates that the configured timeout_ms was passed as a monotonic_clock_deadline.

In fact, when I applied this change:

diff --git a/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb b/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb
index 9289966..38e2d78 100644
--- a/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb
+++ b/elasticgraph-health_check/lib/elastic_graph/health_check/health_checker.rb
@@ -92,7 +92,7 @@ module ElasticGraph
           requested_fields: ["id", recency_config.timestamp_field],
           document_pagination: {first: 1},
           sort: [{recency_config.timestamp_field => {"order" => "desc"}}],
-          monotonic_clock_deadline: monotonic_clock_deadline
+          # monotonic_clock_deadline: monotonic_clock_deadline
         )
       end

...nothing in this spec file failed.

expect(datastore_query_body_by_type).to have_key("Widget")
end

it "uses the default timeout_ms of 5000 when not specified" do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this test does what the description says.

expect(datastore_query_body_by_type).to have_key("Widget")
end

it "allows configuring a custom timeout_ms value" do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this test validates that the custom timeout value is used.

end

describe "error handling" do
it "logs a warning and returns empty results when the datastore raises a GraphQL::ExecutionError" do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this has to do with the rest of this PR. You haven't implemented anything in this PR that rescues GraphQL::ExecutionError, logs a warning and returns an empty result in that case....so why test it?

Are you trying to validate what happens when the timeout is exceeded? If so, I think it would be better to test that in elasticgraph-health_check/spec/acceptance/health_checker_spec.rb. To deterministically cause a timeout you could have a test that sets the timeout_ms to 0 or 1, and then demonstrate the expected http response.

@jwils
Copy link
Collaborator Author

jwils commented Dec 30, 2025

IIRC, there's a way to setup caching at the API gateway layer so that the envoy doesn't send compounding requests. IIRC, there was a SEV related to this in 2024 or 2025 and I thought another team using ElasticGraph figured out a solution related to API gateway configuration. (I'm fuzzy on the details so maybe it wasn't an API gateway solution, but I know a team ran into this).

I believe this code just caches the response at the api gateway layer. That helps things significantly, but I don't think solves the issue of compounding requests as when the cache is invalid multiple simultaneous requests can be sent until there is a new cached value. Looking to this even more, I think this doesn't work at all when the cluster is unhealthy as I think it doesn't cache non 2xx responses.

I think it could make sense for HealthChecker#check_health to do its own caching internally with its own cache expiration logic. e.g. if performing a data_recency_check with expected_max_recency_seconds of 30 and the most recent record is 2 seconds old, we should be able to respond for 28 more seconds without actually querying for a recent record again. Likewise, we could cache the cluster health checks for some period of time.

This seems like a potentially good solution. Although I guess doesn't solve the unhealthy case either as we don't want to cache unhealthy values.

Regarding the tests - Apologies there. I didn't review them after generating them. Will update more carefully once I have a plan that might actually work.

I'm not sure this is actually an issue since the cluster gets unhealthy for another reason, but I see this metric where the number of requests goes way down but it looks like the search rate is significantly elevated. I think this elevation is from the health checks.

Screenshot 2025-12-30 at 9 04 24 AM

My thought process is that this may slow down recovery.

@myronmarston
Copy link
Collaborator

This seems like a potentially good solution. Although I guess doesn't solve the unhealthy case either as we don't want to cache unhealthy values.

I think we can cache unhealthy values, too. We won't want to cache them for as long (e.g. 28 seconds in the example I had above for the healthy case) but I think it would be reasonable to cache them for a configurable length of time. Even caching them for 1 second could be useful--if you're hammering the cluster with hundreds of thousands of searches per second as your graph shows, caching an unhealthy value with a timeout of 1 second could make a big difference, with very little impact on how much longer the endpoint reports "unhealthy" (up to a second).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants