Files
codewhale/docs/OPERATIONS_RUNBOOK.md
T
Hunter Bown 66a3aed528 feat(#248): foreground shell backgrounding, cancel tool, and Ctrl+B TUI shell control
- Add exec_shell_cancel tool to cancel one or all running background shell tasks
- Add foreground-to-background detach path via ShellManager request_foreground_background()
- Add wait_for_shell_delta_cancellable() so exec_shell_wait observes turn cancellation
- Add ShellControlView (Ctrl+B) with Background/Cancel options for active foreground commands
- Add 'Ctrl+B opens shell controls' hint in transcript for running exec cells
- Register exec_shell_cancel in ToolRegistryBuilder::with_shell_tools()
- Cancel-token checks in ShellInteractTool poll loop
- Update keybinding registry and OPERATIONS_RUNBOOK with Ctrl+B documentation
- Update TOOL_SURFACE.md with exec_shell_cancel entry
- Update prompts (rlm first-class guidance, AGENTS.md issue-closure policy)
- Tests: foreground_background, wait_cancel_leaves_running, cancel_tool_single, cancel_tool_all

Closes #248
2026-04-30 18:12:15 -05:00

98 lines
3.4 KiB
Markdown

# DeepSeek TUI Operations Runbook
This runbook covers practical debugging and incident response for the local CLI/TUI runtime.
## Quick Triage
1. Confirm binary + config:
- `cargo run -- --version`
- `cat ~/.deepseek/config.toml` (or inspect configured profile)
2. Enable verbose logs:
- `RUST_LOG=deepseek_cli=debug cargo run`
- For HTTP retries/reconnects: `RUST_LOG=deepseek_cli::client=debug cargo run`
3. Capture current state:
- `ls ~/.deepseek/sessions`
- `ls ~/.deepseek/sessions/checkpoints`
- `ls ~/.deepseek/tasks`
## Incident: Turn Hangs or Stream Stops
Symptoms:
- TUI remains in loading state
- partial assistant output with no completion
Checks:
1. Inspect retry/health logs (`deepseek_cli::client`)
2. Verify endpoint connectivity:
- `curl -sS https://api.deepseek.com/v1/models -H "Authorization: Bearer $DEEPSEEK_API_KEY"`
3. Confirm no local sandbox/permission deadlock in tool output
Actions:
1. If a foreground shell command is running, press `Ctrl+B` and choose whether to background it or cancel the current turn.
2. If the command was started in the background, ask the assistant to cancel it with `exec_shell_cancel` and the returned task id.
3. Use `Esc` or `Ctrl+C` to interrupt the current turn when you want to stop the request itself.
4. Retry prompt; if still failing, restart TUI.
5. On restart, verify the previous queued/in-flight runtime turn is shown as interrupted rather than left in a running state.
## Incident: Network Outage / Offline Behavior
Expected behavior:
- New prompts are queued while offline mode is active
- Queue state persists to `~/.deepseek/sessions/checkpoints/offline_queue.json`
Checks:
1. Open queue in TUI: `/queue list`
2. Confirm persisted queue file exists and updates timestamp
Actions:
1. Restore connectivity
2. Re-send queued entries (from `/queue edit <n>` + Enter, or normal input flow)
3. Ensure queue file clears when queue is empty
## Incident: Crash Recovery Needed
Expected behavior:
- Checkpoint stored at `~/.deepseek/sessions/checkpoints/latest.json`
- Startup begins a fresh session unless `--resume`/`--continue` is supplied
Actions:
1. Resume prior work explicitly via `deepseek --resume <id>` or `Ctrl+R` in TUI
2. If checkpoint inspection is needed, inspect `latest.json` for schema mismatch/details
3. If schema is newer than binary supports, upgrade binary or remove stale checkpoint
## Incident: Persistent State Schema Errors
Symptoms:
- Errors like `schema vX is newer than supported vY`
Affected stores:
- sessions (`~/.deepseek/sessions/*.json`)
- runtime thread/turn/item records
- tasks (`~/.deepseek/tasks/tasks/*.json`)
Actions:
1. Confirm binary version and migration expectations
2. Back up the state directory before editing
3. Either:
- run with a newer compatible binary, or
- archive incompatible records and regenerate state
## Incident: MCP/Tool Execution Failures
Checks:
1. Validate `~/.deepseek/mcp.json` schema and server command paths
2. Confirm server process can start manually
3. Check sandbox denials in TUI history / logs
Actions:
1. Retry with required approvals (or YOLO only when appropriate)
2. Temporarily disable failing MCP server and isolate issue
3. Re-enable after verification with `/mcp` diagnostics
## Post-Incident Checklist
1. Preserve logs and relevant state files
2. Record trigger, impact, and mitigation
3. Add or update regression tests (retry/recovery/schema)
4. Update this runbook and architecture docs if behavior changed