Files
codewhale/docs/OPERATIONS_RUNBOOK.md
T
Hunter Bown 7b91169017 refactor: move source files into workspace crates
- Move src/* into crates/tui/src/ to create a proper workspace structure
- Add .claude/ and .trimtab/ directories for Trimtab closed-loop workflow
- Add DEPENDENCY_GRAPH.md and update documentation
- Update Cargo.toml files to reflect new crate dependencies
- Update CI workflows and npm package scripts
- All tests pass, release build works
2026-03-11 20:00:38 -05:00

96 lines
3.1 KiB
Markdown

# DeepSeek TUI Operations Runbook
This runbook covers practical debugging and incident response for the local CLI/TUI runtime.
## Quick Triage
1. Confirm binary + config:
- `cargo run -- --version`
- `cat ~/.deepseek/config.toml` (or inspect configured profile)
2. Enable verbose logs:
- `RUST_LOG=deepseek_cli=debug cargo run`
- For HTTP retries/reconnects: `RUST_LOG=deepseek_cli::client=debug cargo run`
3. Capture current state:
- `ls ~/.deepseek/sessions`
- `ls ~/.deepseek/sessions/checkpoints`
- `ls ~/.deepseek/tasks`
## Incident: Turn Hangs or Stream Stops
Symptoms:
- TUI remains in loading state
- partial assistant output with no completion
Checks:
1. Inspect retry/health logs (`deepseek_cli::client`)
2. Verify endpoint connectivity:
- `curl -sS https://api.deepseek.com/v1/models -H "Authorization: Bearer $DEEPSEEK_API_KEY"`
3. Confirm no local sandbox/permission deadlock in tool output
Actions:
1. Cancel current turn (`Esc` in TUI while loading)
2. Retry prompt; if still failing, restart TUI
3. On restart, verify the previous queued/in-flight runtime turn is shown as interrupted rather than left in a running state
## Incident: Network Outage / Offline Behavior
Expected behavior:
- New prompts are queued while offline mode is active
- Queue state persists to `~/.deepseek/sessions/checkpoints/offline_queue.json`
Checks:
1. Open queue in TUI: `/queue list`
2. Confirm persisted queue file exists and updates timestamp
Actions:
1. Restore connectivity
2. Re-send queued entries (from `/queue edit <n>` + Enter, or normal input flow)
3. Ensure queue file clears when queue is empty
## Incident: Crash Recovery Needed
Expected behavior:
- Checkpoint stored at `~/.deepseek/sessions/checkpoints/latest.json`
- Startup begins a fresh session unless `--resume`/`--continue` is supplied
Actions:
1. Resume prior work explicitly via `deepseek --resume <id>` or `Ctrl+R` in TUI
2. If checkpoint inspection is needed, inspect `latest.json` for schema mismatch/details
3. If schema is newer than binary supports, upgrade binary or remove stale checkpoint
## Incident: Persistent State Schema Errors
Symptoms:
- Errors like `schema vX is newer than supported vY`
Affected stores:
- sessions (`~/.deepseek/sessions/*.json`)
- runtime thread/turn/item records
- tasks (`~/.deepseek/tasks/tasks/*.json`)
Actions:
1. Confirm binary version and migration expectations
2. Back up the state directory before editing
3. Either:
- run with a newer compatible binary, or
- archive incompatible records and regenerate state
## Incident: MCP/Tool Execution Failures
Checks:
1. Validate `~/.deepseek/mcp.json` schema and server command paths
2. Confirm server process can start manually
3. Check sandbox denials in TUI history / logs
Actions:
1. Retry with required approvals (or YOLO only when appropriate)
2. Temporarily disable failing MCP server and isolate issue
3. Re-enable after verification with `/mcp` diagnostics
## Post-Incident Checklist
1. Preserve logs and relevant state files
2. Record trigger, impact, and mitigation
3. Add or update regression tests (retry/recovery/schema)
4. Update this runbook and architecture docs if behavior changed