Skip to content

Track SP failures across uploads for smarter provider selection #627

@rvagg

Description

@rvagg

(This is a discussion issue, not a concrete feature plan yet, arising out of discussion with @timfong888)

Assuming functionality that exists now in #593

Currently, provider selection is stateless across upload() calls. If an SP fails during store or commit, the error is reported to the caller with providerId, and the caller can retry with excludeProviderIds. Ping failures are already handled internally (auto-excluded, invisible to caller).

This works but puts the retry burden entirely on the developer. The SDK could track recent failures and factor them into selection.

What this would look like:

  • SDK maintains a short-lived record of SP failures (provider ID + timestamp + failure type)
  • selectProviders / smartSelect deprioritises (not excludes) recently-failed SPs
  • Failure memory decays over time (e.g. 10-30 minutes)

Open design questions:

  • How to balance failure history against dataset-matching preference: if SP-A has your dataset but failed 5 minutes ago, do we skip him for a healthy SP that requires a new dataset + payment rail?
  • When all endorsed SPs have recent failures, they're effectively equal again: is time-decayed deprioritisation enough, or do we need a distinct "all failed" fallback?
  • Scope: store failures, commit failures, or both? Commit failures may be chain-related (gas, nonce) rather than SP health.
  • Where does the state live? In-memory on Synapse instance is simplest (it's the stateful bit after all) but lost on restart. Probably fine, this is a session-level optimization, not durable state.

--- Update by @timfong888 Feb 24 2026
The core trade-off: Simpler SDK = more work for developers. Smarter SDK = more edge cases we own.

Ping Request from SP Store Commit Flags
Success Fail Success Fail Success Fail Success Fail
Primary 0 Store Ping other
(All Endorsed -SP failed)
randomly
N/A N/A Commit Currently: throw
Go to Secondary throw
Developer has burden of checking message, retrying
Primary 1 - diff retry before payload GC'd
(24 hour period)

If fails then throw
Limbo state - what happens if fails to commit after 24 hours?

Means 24 hours till we have a success on Endorsed.
Primary 2 - diff If all Endorsed = SP failed then, throw;
Else try (All Endorsed - SP failed)
retry from Ping;
exclude failed ID
retry from Ping;
exclude failed ID;
No retry of commit()
New SP provider, additional floor price;
If chainstate slowness delayed commit, may commit original SP AND the successful retry SP. Now 3 copies.

Retry store: client has additional bandwidth because needs to re-upload
to the new SP.
Secondary Request Ping Approved SPs randomly Store Fails Commit Failure message Success Failure message
Option What It Does Complexity Edge Case Risk Developer Burden
A. Keep Primary 0 Throw on failure, no retry, no state Low Low — developer handles all retries High — they build retry, failover, partial-commit handling
B. Add Primary 1-diff (stateful retry) SDK tracks failed SPs, retries excluding them, returns detailed result objects Medium-High High — see edge cases below Low — SDK handles most failure paths
C. Hybrid — simple SDK + documented patterns Keep SDK at Primary 0, ship retry recipes and error-handling guides Low Low (same as A) Medium — guided but still developer-owned
Edge Case What Happens User Impact
Partial commit Primary commits on-chain, secondary fails User pays for 1 copy, thinks they have 2 (or thinks it all failed)
Stale exclusion SP recovers mid-retry but is still in the failed set Unnecessarily narrow SP pool, may exhaust all options
Duplicate commit on retry Developer retries after partial success → 3 copies on-chain User overpays
Concurrent calls sharing state Two parallel store calls mutate the same failed-SP list Unpredictable routing within TTL window

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    🐱 Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions