Connection timeouts

## Bug description

### Symptoms

1. GRPC requests are taking too long to time out when Kubernetes network policies are misconfigured.
2. The `connection_timeout_is_not_fatal` test takes ~75 seconds to finish.

Well-configured timeouts are important for system stability. Requests which take too long can hog resources and block other work from happening.

### Causes

I can see two separate timeout problems:

1. **DNS resolution** – when `ResolutionStrategy::Lazy` is used, there is currently no way to apply a timeout *just* for DNS resolution. If DNS never resolves, requests never complete.
2. **TCP connection** – there is currently no way to set a connection timeout. On my machine, the socket seems to time out after ~75s. Even when we do set a connection timeout, `tonic` doesn't use it!

Even though we're setting our own fairly short timeouts around the overall request, I've seen some strange behaviour where requests are hanging for a long time. I think there's still something else going on that I don't understand, but I expect addressing the two points above will be generally helpful anyway.

## To Reproduce

For the TCP connection timeout, just run the tests. I'll supply a test for lazy DNS resolution timeouts in a separate PR.

## Expected behavior

Ability to control timeouts for TCP connections and DNS resolution.

## Environment

- OS: MacOS
- Rust version: rustc 1.65.0 (897e37553 2022-11-02)

## Additional context

### Solutions

The TCP connection timeout is simpler to solve (though I will admit took me a long time to find): we just need to set `connect_timeout` in the right places. First, `topic` doesn't respect `connect_timeout`, which will be fixed by hyperium/tonic#1215. When that is merged, we can create our own `connect_timeout` option on top of it in #38.

DNS resolution is harder. There are currently two options:

1. **Lazy resolution** seems to be the default, but it also seems very hard to put a timeout around (at least, I can't think of a way!). It's possible to put an overall timeout around every request in the application, but this is 1. overly broad and 2. error-prone. It's very easy to forget or simply not know that it's needed ("why don't the other timeouts take care of it?").
2. **Eager resolution** has a timeout option. In practice, using this will probably mean doing DNS resolution on service startup, when the `LoadBalancedChannel` is created. This might be a good thing, preventing services from successfully starting when DNS would never resolve.

Of the two, I wonder if we should favour **Eager resolution**, and consider changing the default to this.

~However, we *might* want a third option: **Active lazy resolution** (for want of a better name). **Lazy resolution** is currently *passive*, as in it happens in the background on a schedule. It is never actively called in the request flow, which is why it's hard to put a timeout around. Instead, could we implement something which actively calls `probe_once()` (with a timeout!) as part of the first request (or alternatively when `GrpcServiceProbe.endpoints` is empty)? This could give us lazy DNS resolution, but with timeouts.~

Scratch that, I took a different approach: tower-rs/tower#715. EDIT: Nope, that hasn't worked out. Back to the drawing board.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Connection timeouts #37

Bug description

Symptoms

Causes

To Reproduce

Expected behavior

Environment

Additional context

Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Connection timeouts #37

Description

Bug description

Symptoms

Causes

To Reproduce

Expected behavior

Environment

Additional context

Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions