Files
codewhale/docs/OPERATIONS_RUNBOOK.md
T
Hunter Bown ab2c708ca7 feat: runtime API, task manager, and extensive improvements (v0.3.16)
Major Features:
- Runtime API for external integrations and turn management
- Task manager with persistence and recovery
- Shell output streaming and improved tool execution
- Error taxonomy and audit logging
- Command palette and UI enhancements

Documentation:
- Runtime API documentation
- Operations runbook
- Architecture updates

Fixes:
- Auto-compaction threshold and triggering logic
- Doctor command API key validation
- Clippy and formatting compliance
2026-02-16 10:51:39 -06:00

3.0 KiB

DeepSeek CLI Operations Runbook

This runbook covers practical debugging and incident response for the local CLI/TUI runtime.

Quick Triage

  1. Confirm binary + config:
    • cargo run -- --version
    • cat ~/.deepseek/config.toml (or inspect configured profile)
  2. Enable verbose logs:
    • RUST_LOG=deepseek_cli=debug cargo run
    • For HTTP retries/reconnects: RUST_LOG=deepseek_cli::client=debug cargo run
  3. Capture current state:
    • ls ~/.deepseek/sessions
    • ls ~/.deepseek/sessions/checkpoints
    • ls ~/.deepseek/tasks

Incident: Turn Hangs or Stream Stops

Symptoms:

  • TUI remains in loading state
  • partial assistant output with no completion

Checks:

  1. Inspect retry/health logs (deepseek_cli::client)
  2. Verify endpoint connectivity:
    • curl -sS https://api.deepseek.com/v1/models -H "Authorization: Bearer $DEEPSEEK_API_KEY"
  3. Confirm no local sandbox/permission deadlock in tool output

Actions:

  1. Cancel current turn (Esc in TUI)
  2. Retry prompt; if still failing, restart TUI
  3. On restart, verify crash checkpoint recovery message appears

Incident: Network Outage / Offline Behavior

Expected behavior:

  • New prompts are queued while offline mode is active
  • Queue state persists to ~/.deepseek/sessions/checkpoints/offline_queue.json

Checks:

  1. Open queue in TUI: /queue list
  2. Confirm persisted queue file exists and updates timestamp

Actions:

  1. Restore connectivity
  2. Re-send queued entries (from /queue edit <n> + Enter, or normal input flow)
  3. Ensure queue file clears when queue is empty

Incident: Crash Recovery Needed

Expected behavior:

  • Checkpoint stored at ~/.deepseek/sessions/checkpoints/latest.json
  • Startup auto-restores checkpoint when no explicit --resume target is supplied

Actions:

  1. Start TUI normally and verify "Recovered checkpoint session" status
  2. If automatic recovery fails, inspect checkpoint JSON for schema mismatch
  3. If schema is newer than binary supports, upgrade binary or remove stale checkpoint

Incident: Persistent State Schema Errors

Symptoms:

  • Errors like schema vX is newer than supported vY

Affected stores:

  • sessions (~/.deepseek/sessions/*.json)
  • runtime thread/turn/item records
  • tasks (~/.deepseek/tasks/tasks/*.json)

Actions:

  1. Confirm binary version and migration expectations
  2. Back up the state directory before editing
  3. Either:
    • run with a newer compatible binary, or
    • archive incompatible records and regenerate state

Incident: MCP/Tool Execution Failures

Checks:

  1. Validate ~/.deepseek/mcp.json schema and server command paths
  2. Confirm server process can start manually
  3. Check sandbox denials in TUI history / logs

Actions:

  1. Retry with required approvals (or YOLO only when appropriate)
  2. Temporarily disable failing MCP server and isolate issue
  3. Re-enable after verification with /mcp diagnostics

Post-Incident Checklist

  1. Preserve logs and relevant state files
  2. Record trigger, impact, and mitigation
  3. Add or update regression tests (retry/recovery/schema)
  4. Update this runbook and architecture docs if behavior changed