feat(compaction): debug telemetry on summary calls + document framing fork

Two follow-ups from the post-#572 cache-aligned compaction review (#584):

1. `should_use_cache_aligned_summary` now carries a doc comment that
   explains why the cache-aligned and fallback summary requests are
   framed differently. Cache-aligned replays the conversation as the
   model's own history under `system: None`; fallback reformats it
   into a `User:/Assistant:` transcript under a "concise summaries"
   system prompt. The fallback's external-transcript framing is more
   conservative for the older / smaller models the cache-aligned path
   explicitly excludes, so dropping the system prompt risks regressing
   those models without a corresponding gain. Unifying the two paths
   is a research question that wants an A/B summary-quality eval, not
   a drive-by cleanup — flagged here for a future PR rather than
   resolved silently.

2. `create_summary` now emits one `tracing::debug!` event per summary
   call carrying which path was chosen, the prompt-token count, and
   the cache-hit / miss split. Filter with `RUST_LOG=compaction=debug`
   (or the full module path
   `RUST_LOG=deepseek_tui::compaction=debug`). This makes the V4
   prefix-cache win from #572 observable post-deploy without adding
   UI surface — the compaction summary call is the request we most
   expect to benefit, and previously we had no per-call signal for it.

No UI surface changes. No model-facing prompt changes. Only adds the
path-choice variable and the debug log; existing compaction tests
(56 across `compaction::*` and `models::*`) still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hunter Bown
2026-05-04 12:54:59 -05:00
parent a61da8de2c
commit 7712a37272
+78 -1
View File
@@ -820,7 +820,8 @@ async fn create_summary(
model: &str,
) -> Result<String> {
let limits = summary_input_limits_for_model(model);
let request = if should_use_cache_aligned_summary(model, messages) {
let used_cache_aligned = should_use_cache_aligned_summary(model, messages);
let request = if used_cache_aligned {
build_cache_aligned_summary_request(model, messages, limits)
} else {
build_formatted_summary_request(model, messages, limits)
@@ -832,6 +833,13 @@ async fn create_summary(
// matches the website (#526).
crate::cost_status::report(&response.model, &response.usage);
// #584: emit one debug-level event per summary call so the
// V4 cache-aligned win is observable post-deploy without
// adding UI surface. Filter with `RUST_LOG=compaction=debug`
// (or the full module path
// `RUST_LOG=deepseek_tui::compaction=debug`).
log_summary_cache_telemetry(used_cache_aligned, &response.usage);
// Extract text from response
let summary = response
.content
@@ -846,6 +854,75 @@ async fn create_summary(
Ok(summary)
}
/// Emit one `tracing::debug!` event per compaction summary call so the
/// path choice (cache-aligned vs fallback) and the resulting cache-hit
/// rate are observable. Both raw token counts and the percentage are
/// included; on providers that don't return cache-token fields the
/// counts are reported as `0` and the percentage as `0.0`. (#584)
fn log_summary_cache_telemetry(used_cache_aligned: bool, usage: &crate::models::Usage) {
let path = if used_cache_aligned {
"cache_aligned"
} else {
"fallback"
};
let cache_hit = usage.prompt_cache_hit_tokens.unwrap_or(0);
let cache_miss = usage.prompt_cache_miss_tokens.unwrap_or(0);
let cached_window = u64::from(cache_hit) + u64::from(cache_miss);
let cache_hit_pct = if cached_window > 0 {
(f64::from(cache_hit) * 100.0) / cached_window as f64
} else {
0.0
};
tracing::debug!(
target: "compaction",
"compaction summary call: path={} prompt_tokens={} cache_hit_tokens={} cache_miss_tokens={} cache_hit_pct={:.1}",
path,
usage.input_tokens,
cache_hit,
cache_miss,
cache_hit_pct,
);
}
/// Decide whether to use the cache-aligned summary path
/// ([`build_cache_aligned_summary_request`]) or the fallback
/// ([`build_formatted_summary_request`]). Returns `true` when both
/// gates hold:
///
/// 1. The model has a known large context window
/// (≥ `LARGE_CONTEXT_WINDOW_TOKENS`, currently V4-scale).
/// 2. Replaying the message prefix plus a ~512-token instruction
/// still fits within `CACHE_ALIGNED_SUMMARY_CONTEXT_BUDGET_PERCENT`
/// of that budget.
///
/// ## Why the two paths produce slightly different prompts (#584)
///
/// The two summary requests are *intentionally* framed differently:
///
/// - **Cache-aligned** replays the original `messages` verbatim
/// with `system: None` and appends the summary instruction as
/// the final `user` turn. The model sees the conversation as if
/// it were its own history. This is what lets the V4 prefix cache
/// hit on the bulk of the request (#572).
/// - **Fallback** reformats the conversation into a flat
/// `User:/Assistant:` transcript inside a single `user` message
/// and adds a "You are a helpful assistant that creates concise
/// conversation summaries." system prompt. The model sees a
/// transcript of someone else's conversation.
///
/// The empirical bar is that V4 produces equivalent summaries
/// either way; the post-#572 review noted this fork is worth
/// documenting but not yet worth unifying. The fallback's
/// external-transcript framing is also more conservative for the
/// older / smaller models the cache-aligned path explicitly
/// excludes, so dropping the system prompt would risk regressing
/// those models without a corresponding gain. If we ever want to
/// unify, land it in a separate PR backed by an A/B summary-quality
/// evaluation rather than as a drive-by cleanup.
///
/// `create_summary` emits a `tracing::debug!` event under
/// `target = "compaction"` after each call so the path choice and
/// cache-hit rate are observable post-deploy without UI surface.
fn should_use_cache_aligned_summary(model: &str, messages: &[Message]) -> bool {
let Some(window) = context_window_for_model(model) else {
return false;