Skip to content

Conversation

@rae-306
Copy link

@rae-306 rae-306 commented Aug 18, 2025

Ingestion failed for us because of our bumpy network.

It turns out that the retry strategy in stream_http_lines is quite basic: retry 3 times and sleep 1 second in between.

This PR improves the stream_http_lines function retry strategy. It does so by defining a global retry strategy (exponential backoff) at the level of the httpx_session client defined in yente.data.util.

@rae-306 rae-306 marked this pull request as ready for review August 18, 2025 12:38
@rae-306 rae-306 marked this pull request as draft August 18, 2025 12:48
@rae-306 rae-306 marked this pull request as ready for review August 18, 2025 16:50
@rae-306 rae-306 changed the title Fix for retry strategy in httpx transport Improved retry strategy in httpx transport Aug 18, 2025
@leonhandreke
Copy link
Contributor

Thanks for bringing this up! I have a few thoughts on this:

First, I think the retry semantics on the streaming HTTP helpers are currently broken. I don't think the retry does any Range-request magic, so we'll just get the same source line yielded multiple times. That doesn't matter too much for us since the indexing (hopefully) idempotent, but still. I think the right thing to do here is to bubble the error to a higher level and do something more useful there: start streaming again from a clean slate or fall back to non-streaming data retrieval.

Second, I'm a bit hesitant about introducing a dependency on httpx-retries. I realize there is little rocket science going on inside that package, but the GitHub repo only has tens of stars, so it seems to be a bit of a niche package. The HTTPX
docs recommend the tenacity package.

While we're working on this, if you're on a bumpy connection, have you considered setting YENTE_STREAM_LOAD to false? It might just solve your problem the easy way for the moment.

Looking forward to hear your thoughts!

@rae-306
Copy link
Author

rae-306 commented Aug 19, 2025

Hi Leon, thanks for your response. After hours of debugging, we finally managed to get it working. Our bumpy network would cause a TransportError every now and then while we were loading the "default" dataset with 4.1M entries.
There were mainly two issues we discovered in the code, for which I intend to submit a PR. Perhaps it is better to close this one and begin anew. Anyhow, the two bugs are:

  1. The stream_http_lines should retry 3 times. However, if the client.stream() throws a TransportError the first time, but works correctly the second time, then the retry counter is not reset. Which means that if the TransportError is thrown more than three times in total, not necessarily consecutively (for instance, if there is a small hiccup in the network every 5 mins), then the retry fails completely.

  2. As you mention, there is no "range-request" magic in the dataset JSON. So if a TransportError is thrown, then the stream starts from the beginning of the JSON. Of course, the process is idempotent: the doc count in ES stays constant and until it catches up with where it left, it simply immediately deletes every new file it inserts. However, there are two problems with this:
    a) Your counter from yente.search.indexer "Index: ... entities..." is not reset / continues to grow.
    b) In our case, if a network is bumpy enough, then this process would go on and on. Suppose that every few minutes there is a small hiccup in the network, then the loading of the full json would take forever.
    What we did was we changed the stream_http_lines function to do the following. We make the function keep track of the "last_id" and on a retry it reads the json until it find the last_id and only yields from that point on. Since reading a json is way faster than inserting in Elasticsearch, this speeds up enormously. Also the counter now gives the correct amount :)

@leonhandreke
Copy link
Contributor

leonhandreke commented Aug 25, 2025

  1. The stream_http_lines should retry 3 times. However, if the client.stream() throws a TransportError the first time, but works correctly the second time, then the retry counter is not reset. Which means that if the TransportError is thrown more than three times in total, not necessarily consecutively (for instance, if there is a small hiccup in the network every 5 mins), then the retry fails completely.

IIUC, you would like the retry counter to be reset if at least one byte flows in again after a TransportError was thrown. While I see why in your case, I also feel like this isn't really a bug. If max retries is 3, I'd expect the thing to be retried at most three times, even if it fails at different points.

Does http-retries do that? I'd be a bit surprised...

  1. As you mention, there is no "range-request" magic in the dataset JSON. So if a TransportError is thrown, then the stream starts from the beginning of the JSON. Of course, the process is idempotent: the doc count in ES stays constant and until it catches up with where it left, it simply immediately deletes every new file it inserts. However, there are two problems with this:
    a) Your counter from yente.search.indexer "Index: ... entities..." is not reset / continues to grow.
    b) In our case, if a network is bumpy enough, then this process would go on and on. Suppose that every few minutes there is a small hiccup in the network, then the loading of the full json would take forever.
    What we did was we changed the stream_http_lines function to do the following. We make the function keep track of the "last_id" and on a retry it reads the json until it find the last_id and only yields from that point on. Since reading a json is way faster than inserting in Elasticsearch, this speeds up enormously. Also the counter now gives the correct amount :)

Yeah, I guess the right thing to do here would be to reset the counter and retry in the indexer, but that's a lot of work touching a few layers for something that hasn't really been that much of an issue so far.

How do you feel about #857. It tries to stream only once and then falls back to fetching. Does that solve your issue. Related, have you tried setting setting YENTE_STREAM_LOAD to false? Maybe that's the best fix, given your difficult networking situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants