Improved retry strategy in httpx transport #848

rae-306 · 2025-08-18T12:35:53Z

Ingestion failed for us because of our bumpy network.

It turns out that the retry strategy in stream_http_lines is quite basic: retry 3 times and sleep 1 second in between.

This PR improves the stream_http_lines function retry strategy. It does so by defining a global retry strategy (exponential backoff) at the level of the httpx_session client defined in yente.data.util.

leonhandreke · 2025-08-19T11:27:38Z

Thanks for bringing this up! I have a few thoughts on this:

First, I think the retry semantics on the streaming HTTP helpers are currently broken. I don't think the retry does any Range-request magic, so we'll just get the same source line yielded multiple times. That doesn't matter too much for us since the indexing (hopefully) idempotent, but still. I think the right thing to do here is to bubble the error to a higher level and do something more useful there: start streaming again from a clean slate or fall back to non-streaming data retrieval.

Second, I'm a bit hesitant about introducing a dependency on httpx-retries. I realize there is little rocket science going on inside that package, but the GitHub repo only has tens of stars, so it seems to be a bit of a niche package. The HTTPX
docs recommend the tenacity package.

While we're working on this, if you're on a bumpy connection, have you considered setting YENTE_STREAM_LOAD to false? It might just solve your problem the easy way for the moment.

Looking forward to hear your thoughts!

rae-306 · 2025-08-19T12:37:53Z

Hi Leon, thanks for your response. After hours of debugging, we finally managed to get it working. Our bumpy network would cause a TransportError every now and then while we were loading the "default" dataset with 4.1M entries.
There were mainly two issues we discovered in the code, for which I intend to submit a PR. Perhaps it is better to close this one and begin anew. Anyhow, the two bugs are:

The stream_http_lines should retry 3 times. However, if the client.stream() throws a TransportError the first time, but works correctly the second time, then the retry counter is not reset. Which means that if the TransportError is thrown more than three times in total, not necessarily consecutively (for instance, if there is a small hiccup in the network every 5 mins), then the retry fails completely.
As you mention, there is no "range-request" magic in the dataset JSON. So if a TransportError is thrown, then the stream starts from the beginning of the JSON. Of course, the process is idempotent: the doc count in ES stays constant and until it catches up with where it left, it simply immediately deletes every new file it inserts. However, there are two problems with this:
a) Your counter from yente.search.indexer "Index: ... entities..." is not reset / continues to grow.
b) In our case, if a network is bumpy enough, then this process would go on and on. Suppose that every few minutes there is a small hiccup in the network, then the loading of the full json would take forever.
What we did was we changed the stream_http_lines function to do the following. We make the function keep track of the "last_id" and on a retry it reads the json until it find the last_id and only yields from that point on. Since reading a json is way faster than inserting in Elasticsearch, this speeds up enormously. Also the counter now gives the correct amount :)

See #848

leonhandreke · 2025-08-25T15:23:30Z

The stream_http_lines should retry 3 times. However, if the client.stream() throws a TransportError the first time, but works correctly the second time, then the retry counter is not reset. Which means that if the TransportError is thrown more than three times in total, not necessarily consecutively (for instance, if there is a small hiccup in the network every 5 mins), then the retry fails completely.

IIUC, you would like the retry counter to be reset if at least one byte flows in again after a TransportError was thrown. While I see why in your case, I also feel like this isn't really a bug. If max retries is 3, I'd expect the thing to be retried at most three times, even if it fails at different points.

Does http-retries do that? I'd be a bit surprised...

As you mention, there is no "range-request" magic in the dataset JSON. So if a TransportError is thrown, then the stream starts from the beginning of the JSON. Of course, the process is idempotent: the doc count in ES stays constant and until it catches up with where it left, it simply immediately deletes every new file it inserts. However, there are two problems with this:
a) Your counter from yente.search.indexer "Index: ... entities..." is not reset / continues to grow.
b) In our case, if a network is bumpy enough, then this process would go on and on. Suppose that every few minutes there is a small hiccup in the network, then the loading of the full json would take forever.
What we did was we changed the stream_http_lines function to do the following. We make the function keep track of the "last_id" and on a retry it reads the json until it find the last_id and only yields from that point on. Since reading a json is way faster than inserting in Elasticsearch, this speeds up enormously. Also the counter now gives the correct amount :)

Yeah, I guess the right thing to do here would be to reset the counter and retry in the indexer, but that's a lot of work touching a few layers for something that hasn't really been that much of an issue so far.

How do you feel about #857. It tries to stream only once and then falls back to fetching. Does that solve your issue. Related, have you tried setting setting YENTE_STREAM_LOAD to false? Maybe that's the best fix, given your difficult networking situation.

rae-306 added 2 commits August 18, 2025 14:24

Retry strategy for all httpx transport

c9aa547

Add httpx-retries package

f93dec1

rae-306 marked this pull request as ready for review August 18, 2025 12:38

rae-306 marked this pull request as draft August 18, 2025 12:48

rae-306 marked this pull request as ready for review August 18, 2025 16:50

rae-306 changed the title ~~Fix for retry strategy in httpx transport~~ Improved retry strategy in httpx transport Aug 18, 2025

leonhandreke added a commit that referenced this pull request Aug 25, 2025

loader: Fall through to fetch if stream fails and retry fetch

d8fa213

See #848

leonhandreke mentioned this pull request Aug 25, 2025

loader: Fall through to fetch if stream fails and retry fetch #857

Draft

leonhandreke added a commit that referenced this pull request Aug 25, 2025

loader: Fall through to fetch if stream fails and retry fetch

a47762d

See #848

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improved retry strategy in httpx transport #848

Improved retry strategy in httpx transport #848

Uh oh!

rae-306 commented Aug 18, 2025 •

edited

Loading

Uh oh!

leonhandreke commented Aug 19, 2025

Uh oh!

rae-306 commented Aug 19, 2025 •

edited

Loading

Uh oh!

leonhandreke commented Aug 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Improved retry strategy in httpx transport #848

Are you sure you want to change the base?

Improved retry strategy in httpx transport #848

Uh oh!

Conversation

rae-306 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonhandreke commented Aug 19, 2025

Uh oh!

rae-306 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonhandreke commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rae-306 commented Aug 18, 2025 •

edited

Loading

rae-306 commented Aug 19, 2025 •

edited

Loading

leonhandreke commented Aug 25, 2025 •

edited

Loading