fix(file_mention): preserve UTF-8 codepoint boundary when truncating mention contents

Closes #1441. When `@`-mentioning a file larger than the 128 KB `MAX_MENTION_FILE_BYTES` ceiling, the truncator clipped the buffer to exactly the cap — which on CJK / emoji content frequently landed mid-codepoint and left a stray U+FFFD replacement char at the cut point. The fix uses `str::from_utf8(...).error_len()` to distinguish the two ways a truncated UTF-8 buffer can fail: - `error_len() == None` means the failure is an incomplete tail sequence — exactly the boundary case we want to handle. Round `buffer.truncate()` down to `valid_up_to()` so the trailing bytes are dropped cleanly. - `error_len() == Some(_)` means the file genuinely contains invalid UTF-8 bytes (not at the truncation boundary). Leave the buffer intact so the subsequent `from_utf8(&buffer)` call surfaces the canonical "file is not UTF-8" error rather than silently dropping the invalid bytes. Collapsed the if-let-then-if pattern to `if let Err(e) = ... && e.error_len().is_none()` to satisfy the workspace's `collapsible_if` clippy gate. Harvested from PR #1495 by @CrepuscularIRIS Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 01:27:20 -05:00
parent dcc2c448eb
commit 6f70a2832e
2 changed files with 55 additions and 7 deletions
@@ -16,6 +16,18 @@ real world uses."

 ### Fixed

+- **`@`-mention truncation no longer splits multi-byte UTF-8
+  sequences** (#1441, harvested from PR #1495 by
+  **@CrepuscularIRIS / autoghclaw**). When `@`-mentioning a file
+  larger than 128 KB the composer truncated the buffer at exactly
+  `MAX_MENTION_FILE_BYTES`, which on CJK / emoji content landed
+  mid-codepoint and produced a stray U+FFFD at the cut point. The
+  truncator now uses `str::from_utf8(...).error_len()` to detect
+  the incomplete-tail case and rounds down to the last valid
+  codepoint boundary before decoding. Genuinely invalid UTF-8
+  files still surface the "file is not UTF-8" error (the rounding
+  is only applied when the error is an incomplete tail, not a
+  real decoding failure mid-buffer).
 - **vLLM provider: `reasoning_effort = "off"` now actually
  disables thinking on Qwen3 / DeepSeek-R1 servers, cutting
  TTFT from ~13s to ~270ms** (harvested from PR #1480 by