docs: harvest provider fallback chain RFC

Harvested from PR #2581 by @idling11. Co-authored-by: idling11 <8055620+idling11@users.noreply.github.com>
2026-06-03 19:55:14 -07:00
parent 44ceabd606
commit 5dc1a63cd4
3 changed files with 171 additions and 2 deletions
@@ -551,6 +551,7 @@ without recreating skills the user deliberately deleted.
 | [LOCALIZATION.md](docs/LOCALIZATION.md) | UI locale matrix & switching |
 | [OPERATIONS_RUNBOOK.md](docs/OPERATIONS_RUNBOOK.md) | Ops & recovery |
 | [V0_9_0_EXECUTION_MAP.md](docs/V0_9_0_EXECUTION_MAP.md) | v0.9.0 issue lanes, PR harvest state, and release gates |
+| [2574-provider-fallback-chain.md](docs/rfcs/2574-provider-fallback-chain.md) | Provider fallback chain RFC |

 Full Changelog: [CHANGELOG.md](CHANGELOG.md).

@@ -42,6 +42,7 @@ harvest/stewardship commits:
 | #2636 project-context mtime cache | Defer direct merge; harvest only after cache key/signature is widened. | Must include constitution changes, auto-generated context deletion, canonical path equivalence, and overwrite detection before landing. |
 | #2634 HarmonyOS port | Defer direct merge; draft has broad platform and TLS/runtime blast radius. | Harvest at most the unused `rustyline` cleanup after local verification; full port needs OHOS target checks and sandbox/security review. |
 | #2687 append-only mode/approval prompt | Defer direct merge; draft has compile failures and Plan-mode prompt correctness risks. | Any future harvest must keep stable `message[0]` genuinely mode-agnostic, preserve mode/approval suffixes after capacity replans, and distinguish external overrides from persisted generated prompts. |
+| #2581 provider fallback chain design doc | Manually harvested as `docs/rfcs/2574-provider-fallback-chain.md` because the current PR head has no net file changes. | Keep issue #2574 open for implementation; close/comment on #2581 after the integration branch is public, crediting @idling11 and reporter @hsdbeebou. |

 ## PR Harvest Queue

@@ -85,7 +86,7 @@ harvest/stewardship commits:
 | #2576 PrefixCacheChange events | Mergeable | Review after current prefix-cache commits. |
 | #2578 turn_end observer hook | Conflicting | Defer to hook lifecycle lane. |
 | #2579 AppendLog session messages | Conflicting | Defer; large architectural change. |
-| #2581 provider fallback chain design doc | Mergeable | Docs-only; review for current provider direction. |
+| #2581 provider fallback chain design doc | Mergeable / empty diff | Manually harvested into `docs/rfcs/2574-provider-fallback-chain.md`; close original PR after branch is public, keep #2574 open for implementation. |
 | #2623 plan prompt modal scroll support | Mergeable | Already harvested into the 22-commit stack. Comment/close original after integration branch is public. |
 | #2627 Xiaomi MiMo Token Plan mode | Conflicting | Partially harvested; leave original open or comment with remaining mode/env scope once branch is public. |
 | #2631 estimated_input_tokens cache | Mergeable | Already harvested into the 22-commit stack. |
@@ -120,7 +121,7 @@ Issue count should drop through evidence-backed consolidation, not bulk closing.

 ## Immediate Next Actions

-1. Review #2048, #2502, #2509, #2513, #2530, #2576, and #2581 as the next small
+1. Review #2048, #2502, #2509, #2513, #2530, and #2576 as the next small
   mergeable candidates.
 2. Prepare public comments for #2708, #2627, #2634, #2636, #2687, and already-harvested performance
   PRs once this integration branch has a remote review surface.
@@ -0,0 +1,167 @@
+# RFC: Provider Fallback Chain
+
+**Issue:** #2574
+**Reporter:** @hsdbeebou
+**Design source:** #2581 by @idling11
+**Status:** Draft for the v0.9 provider-routing lane
+**Date:** 2026-06-04
+
+## Problem
+
+CodeWhale can store credentials and defaults for several providers, but a
+running session uses one active provider route at a time. When that provider
+hits a rate limit, temporary outage, or transport failure, the user must notice
+the failure, run `/provider`, choose another route, and resubmit the turn.
+
+That manual switch is especially disruptive during long-running agentic work.
+A provider fallback chain can keep work moving, but it also changes billing
+source, model behavior, tool support, context-window limits, and vendor
+expectations. The design must make that switch explicit and capability-aware.
+
+## Principles
+
+- Fallback is opt-in. No provider switch happens unless the user configured a
+  fallback chain.
+- Billing and vendor changes are visible in the transcript and status UI.
+- Normal retry policy runs before fallback.
+- Fallback is allowed only before assistant content or tool calls have started
+  streaming for the failing request.
+- Fallback candidates must support the request shape for the current turn.
+- Authentication, authorization, malformed request, and model-not-found errors
+  do not silently switch providers by default.
+
+## Proposed Config Shape
+
+Keep the existing root `provider = "..."` setting as the primary route. Add an
+ordered fallback list and a small policy section:
+
+```toml
+provider = "nvidia-nim"
+fallback_providers = ["deepseek", "openrouter"]
+
+[provider_fallback]
+enabled = true
+reset_on_new_session = true
+```
+
+Rules:
+
+- `fallback_providers` is ordered and contains provider IDs already accepted by
+  the provider parser.
+- The primary provider is not repeated in the fallback list.
+- Duplicate fallback providers are rejected.
+- Missing credentials produce a startup warning and make that fallback entry
+  inactive until credentials appear.
+- If `provider_fallback.enabled` is absent, the presence of a non-empty
+  `fallback_providers` list enables fallback.
+
+## Fallback Eligibility
+
+| Failure | Fallback by default? | Notes |
+| --- | --- | --- |
+| HTTP 429 | Yes | Rate limit or quota exhaustion on the active route. |
+| HTTP 502, 503, 504 | Yes | Temporary upstream failure after normal retries. |
+| Connect timeout / DNS failure | Yes | Transport path failed before content streamed. |
+| HTTP 401 / 403 | No | Usually bad credentials or account permissions. |
+| HTTP 400 | No | Usually client request shape or model parameter issue. |
+| Model not found | No | Avoid silently switching model families unless a future policy explicitly opts in. |
+| Stream interrupted after content | No | The transcript may already contain partial assistant content or tool-call deltas. |
+
+The first implementation should classify errors centrally and expose tests for
+each case before any fallback execution is wired into the turn loop.
+
+## Capability Gate
+
+Before switching to a fallback provider/model, CodeWhale checks that the
+candidate can support the current request shape:
+
+| Requirement | Gate |
+| --- | --- |
+| Tool calls | Candidate provider/model must support tool calling. |
+| Reasoning effort | Candidate must support the requested thinking mode, or the switch is blocked. |
+| Context size | Candidate context window must fit the estimated current request. |
+| Image inputs | Candidate must support vision if the turn includes images. |
+| Provider-specific headers | Candidate request must be rebuilt from that provider's own auth/base-url/header rules. |
+
+If no fallback candidate passes the gate, CodeWhale surfaces the original
+provider error with a clear "fallback chain exhausted or incompatible" note.
+
+## Runtime Behavior
+
+1. Build the request for the active provider.
+2. Run existing retry policy for that provider.
+3. If retries exhaust with a fallback-eligible failure and no assistant content
+   has streamed, evaluate the next fallback provider.
+4. Rebuild the request with the fallback provider's model, base URL, auth, and
+   provider-specific headers.
+5. Add a visible transcript marker and status event before the fallback request
+   starts.
+6. Continue through the chain until a provider succeeds, the chain is
+   exhausted, or a non-eligible failure occurs.
+
+Suggested transcript marker:
+
+```text
+[provider fallback: nvidia-nim -> deepseek, reason: rate_limit]
+```
+
+Suggested status text:
+
+```text
+NVIDIA NIM unavailable; switched to DeepSeek fallback
+```
+
+For multi-request turns, such as tool-call result follow-ups, fallback can be
+considered for a later request only if that later request has not started
+streaming assistant content yet. The transcript marker must identify that the
+turn changed provider between requests.
+
+## UI and Commands
+
+- `/provider` should show the primary route and the current fallback position.
+- `/provider reset` should return to the primary provider for future requests in
+  the current session.
+- The footer/statusline should surface the concrete provider/model that actually
+  handled the latest request.
+- Session receipts should record both attempted provider and successful
+  provider so cost and debugging information stay truthful.
+
+## Implementation Slices
+
+1. Config schema and validation:
+   - parse `fallback_providers` and `[provider_fallback]`
+   - validate known providers, duplicates, missing credentials, and primary
+     self-reference
+   - document the config surface
+2. Error classification:
+   - define fallback-eligible error kinds
+   - add unit tests for HTTP and transport failures
+3. Request-shape capability gate:
+   - evaluate tool, thinking, context, and image requirements
+   - add tests for incompatible fallbacks
+4. Fallback execution:
+   - run retries per provider before moving to the next provider
+   - rebuild auth/base-url/header state for each candidate
+   - block fallback after partial streaming
+5. UI/receipt integration:
+   - status event
+   - transcript marker
+   - `/provider reset`
+   - receipt fields for attempted and selected provider
+
+## Non-goals
+
+- No automatic cost optimization or weighted provider selection.
+- No silent fallback when authentication or permissions fail.
+- No fallback after partial assistant content or tool-call deltas have streamed.
+- No provider/model capability downgrades without an explicit future policy.
+- No sub-agent-specific fallback policy in the first implementation; sub-agents
+  inherit the same configured fallback chain unless they are given an explicit
+  provider/model override.
+
+## Credit
+
+This RFC is based on issue #2574 from @hsdbeebou and PR #2581 from @idling11.
+The original PR head currently has no net file changes, so this document
+preserves the useful design direction while tightening the v0.9 contract around
+truthful provider routing, billing visibility, and capability checks.