Skip to content

Connection timeouts #37

@ThomWright

Description

@ThomWright

Bug description

Symptoms

  1. GRPC requests are taking too long to time out when Kubernetes network policies are misconfigured.
  2. The connection_timeout_is_not_fatal test takes ~75 seconds to finish.

Well-configured timeouts are important for system stability. Requests which take too long can hog resources and block other work from happening.

Causes

I can see two separate timeout problems:

  1. DNS resolution – when ResolutionStrategy::Lazy is used, there is currently no way to apply a timeout just for DNS resolution. If DNS never resolves, requests never complete.
  2. TCP connection – there is currently no way to set a connection timeout. On my machine, the socket seems to time out after ~75s. Even when we do set a connection timeout, tonic doesn't use it!

Even though we're setting our own fairly short timeouts around the overall request, I've seen some strange behaviour where requests are hanging for a long time. I think there's still something else going on that I don't understand, but I expect addressing the two points above will be generally helpful anyway.

To Reproduce

For the TCP connection timeout, just run the tests. I'll supply a test for lazy DNS resolution timeouts in a separate PR.

Expected behavior

Ability to control timeouts for TCP connections and DNS resolution.

Environment

  • OS: MacOS
  • Rust version: rustc 1.65.0 (897e37553 2022-11-02)

Additional context

Solutions

The TCP connection timeout is simpler to solve (though I will admit took me a long time to find): we just need to set connect_timeout in the right places. First, topic doesn't respect connect_timeout, which will be fixed by hyperium/tonic#1215. When that is merged, we can create our own connect_timeout option on top of it in #38.

DNS resolution is harder. There are currently two options:

  1. Lazy resolution seems to be the default, but it also seems very hard to put a timeout around (at least, I can't think of a way!). It's possible to put an overall timeout around every request in the application, but this is 1. overly broad and 2. error-prone. It's very easy to forget or simply not know that it's needed ("why don't the other timeouts take care of it?").
  2. Eager resolution has a timeout option. In practice, using this will probably mean doing DNS resolution on service startup, when the LoadBalancedChannel is created. This might be a good thing, preventing services from successfully starting when DNS would never resolve.

Of the two, I wonder if we should favour Eager resolution, and consider changing the default to this.

However, we might want a third option: Active lazy resolution (for want of a better name). Lazy resolution is currently passive, as in it happens in the background on a schedule. It is never actively called in the request flow, which is why it's hard to put a timeout around. Instead, could we implement something which actively calls probe_once() (with a timeout!) as part of the first request (or alternatively when GrpcServiceProbe.endpoints is empty)? This could give us lazy DNS resolution, but with timeouts.

Scratch that, I took a different approach: tower-rs/tower#715. EDIT: Nope, that hasn't worked out. Back to the drawing board.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions