fix(engine): transparent retry on stream death with no content (#103 Phase 3)
When the chunked-transfer connection to DeepSeek dies mid-stream — the "Stream read error: error decoding response body" symptom — the engine previously surfaced the error to the user and ended the turn as Failed, even when no useful content had been received. The user's only recourse was to manually re-send the same message. Phase 3 closes that loop. After the inner stream-consumption loop ends, detect "stream died with nothing actionable": - stream_errors > 0 (the stream errored at some point) - tool_uses.is_empty() (no tool call landed) - current_text_visible is empty/whitespace - current_thinking is empty/whitespace - !pending_message_complete If all hold AND stream_retry_attempts < MAX_STREAM_RETRIES (3), silently re-issue the SAME outer-loop iteration: rebuilds the request from self.session.messages, calls create_message_stream again, and starts a fresh inner loop. Surface a "Connection interrupted; retrying (N/3)" status to the user so they know something's happening, but don't trip the engine-level Error event so we don't double-display the failure as a History cell. Healthy rounds (stream_errors == 0) reset the retry budget so a single proxy hiccup doesn't poison subsequent rounds in the same turn. Crucially: if we got partial output (any tool call, any visible text, or any thinking), we DON'T retry. Re-running the request would double-bill the user; ship the partial state to the rest of the turn pipeline (existing tool execution, content_blocks finalization) and let the agent loop continue. Combined with #103 Phase 1+2 (TCP/HTTP2 keepalives + diagnostic logging in client.rs), this should turn the user-visible "Turn failed: Stream read error" into either a fully-recovered turn OR a clearly-labeled 3-attempts-exhausted failure. Refs #103. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2022,6 +2022,16 @@ impl Engine {
|
||||
}
|
||||
let mut active_tool_names = initial_active_tools(&tool_catalog);
|
||||
|
||||
// Transparent stream-retry counter: when the chunked-transfer
|
||||
// connection dies mid-stream and we got nothing useful out of it
|
||||
// (no tool calls, no completed text), we silently re-issue the
|
||||
// SAME request up to MAX_STREAM_RETRIES times before surfacing
|
||||
// the failure to the user. This is the #103 Phase 3 retry that
|
||||
// keeps long V4 thinking turns from being killed by transient
|
||||
// proxy disconnects.
|
||||
const MAX_STREAM_RETRIES: u32 = 3;
|
||||
let mut stream_retry_attempts: u32 = 0;
|
||||
|
||||
loop {
|
||||
if self.cancel_token.is_cancelled() {
|
||||
let _ = self.tx_event.send(Event::status("Request cancelled")).await;
|
||||
@@ -2566,6 +2576,49 @@ impl Engine {
|
||||
}
|
||||
}
|
||||
|
||||
// #103 Phase 3 — transparent retry. The inner loop above bails
|
||||
// when reqwest yields chunk decode errors three times in a row;
|
||||
// most of the time those are recoverable proxy / HTTP/2 issues
|
||||
// and the request can simply be re-issued. Re-issue silently up
|
||||
// to MAX_STREAM_RETRIES, but only when the stream produced
|
||||
// nothing actionable — if any tool call landed or text was
|
||||
// streamed, ship the partial state to the rest of the turn
|
||||
// pipeline so we don't double-bill the user by re-running it.
|
||||
let stream_died_with_nothing = stream_errors > 0
|
||||
&& tool_uses.is_empty()
|
||||
&& current_text_visible.trim().is_empty()
|
||||
&& current_thinking.trim().is_empty()
|
||||
&& !pending_message_complete;
|
||||
if stream_died_with_nothing {
|
||||
if stream_retry_attempts < MAX_STREAM_RETRIES {
|
||||
stream_retry_attempts = stream_retry_attempts.saturating_add(1);
|
||||
crate::logging::warn(format!(
|
||||
"Stream died with no content (attempt {}/{}); retrying request",
|
||||
stream_retry_attempts, MAX_STREAM_RETRIES
|
||||
));
|
||||
let _ = self
|
||||
.tx_event
|
||||
.send(Event::status(format!(
|
||||
"Connection interrupted; retrying ({}/{})",
|
||||
stream_retry_attempts, MAX_STREAM_RETRIES
|
||||
)))
|
||||
.await;
|
||||
// Don't preserve the per-stream `turn_error` — we're
|
||||
// about to retry, and a successful retry should not
|
||||
// surface the transient error as the turn outcome.
|
||||
turn_error = None;
|
||||
continue;
|
||||
}
|
||||
crate::logging::warn(format!(
|
||||
"Stream retry budget exhausted ({} attempts); failing turn",
|
||||
stream_retry_attempts
|
||||
));
|
||||
} else if stream_errors == 0 {
|
||||
// Healthy round → reset retry budget so we don't carry over
|
||||
// state from a previous bad round.
|
||||
stream_retry_attempts = 0;
|
||||
}
|
||||
|
||||
// Update turn usage
|
||||
turn.add_usage(&usage);
|
||||
|
||||
|
||||
Reference in New Issue
Block a user