Files
codewhale/docs/OPERATIONS_RUNBOOK.md
T
Hunter Bown ab2c708ca7 feat: runtime API, task manager, and extensive improvements (v0.3.16)
Major Features:
- Runtime API for external integrations and turn management
- Task manager with persistence and recovery
- Shell output streaming and improved tool execution
- Error taxonomy and audit logging
- Command palette and UI enhancements

Documentation:
- Runtime API documentation
- Operations runbook
- Architecture updates

Fixes:
- Auto-compaction threshold and triggering logic
- Doctor command API key validation
- Clippy and formatting compliance
2026-02-16 10:51:39 -06:00

96 lines
3.0 KiB
Markdown

# DeepSeek CLI Operations Runbook
This runbook covers practical debugging and incident response for the local CLI/TUI runtime.
## Quick Triage
1. Confirm binary + config:
- `cargo run -- --version`
- `cat ~/.deepseek/config.toml` (or inspect configured profile)
2. Enable verbose logs:
- `RUST_LOG=deepseek_cli=debug cargo run`
- For HTTP retries/reconnects: `RUST_LOG=deepseek_cli::client=debug cargo run`
3. Capture current state:
- `ls ~/.deepseek/sessions`
- `ls ~/.deepseek/sessions/checkpoints`
- `ls ~/.deepseek/tasks`
## Incident: Turn Hangs or Stream Stops
Symptoms:
- TUI remains in loading state
- partial assistant output with no completion
Checks:
1. Inspect retry/health logs (`deepseek_cli::client`)
2. Verify endpoint connectivity:
- `curl -sS https://api.deepseek.com/v1/models -H "Authorization: Bearer $DEEPSEEK_API_KEY"`
3. Confirm no local sandbox/permission deadlock in tool output
Actions:
1. Cancel current turn (`Esc` in TUI)
2. Retry prompt; if still failing, restart TUI
3. On restart, verify crash checkpoint recovery message appears
## Incident: Network Outage / Offline Behavior
Expected behavior:
- New prompts are queued while offline mode is active
- Queue state persists to `~/.deepseek/sessions/checkpoints/offline_queue.json`
Checks:
1. Open queue in TUI: `/queue list`
2. Confirm persisted queue file exists and updates timestamp
Actions:
1. Restore connectivity
2. Re-send queued entries (from `/queue edit <n>` + Enter, or normal input flow)
3. Ensure queue file clears when queue is empty
## Incident: Crash Recovery Needed
Expected behavior:
- Checkpoint stored at `~/.deepseek/sessions/checkpoints/latest.json`
- Startup auto-restores checkpoint when no explicit `--resume` target is supplied
Actions:
1. Start TUI normally and verify "Recovered checkpoint session" status
2. If automatic recovery fails, inspect checkpoint JSON for schema mismatch
3. If schema is newer than binary supports, upgrade binary or remove stale checkpoint
## Incident: Persistent State Schema Errors
Symptoms:
- Errors like `schema vX is newer than supported vY`
Affected stores:
- sessions (`~/.deepseek/sessions/*.json`)
- runtime thread/turn/item records
- tasks (`~/.deepseek/tasks/tasks/*.json`)
Actions:
1. Confirm binary version and migration expectations
2. Back up the state directory before editing
3. Either:
- run with a newer compatible binary, or
- archive incompatible records and regenerate state
## Incident: MCP/Tool Execution Failures
Checks:
1. Validate `~/.deepseek/mcp.json` schema and server command paths
2. Confirm server process can start manually
3. Check sandbox denials in TUI history / logs
Actions:
1. Retry with required approvals (or YOLO only when appropriate)
2. Temporarily disable failing MCP server and isolate issue
3. Re-enable after verification with `/mcp` diagnostics
## Post-Incident Checklist
1. Preserve logs and relevant state files
2. Record trigger, impact, and mitigation
3. Add or update regression tests (retry/recovery/schema)
4. Update this runbook and architecture docs if behavior changed