codewhale/docs/rfcs/REMOTE_SETUP_DESIGN.md

# `codewhale remote-setup` — Design & Implementation Plan

Status: **design** (do not implement against the 0.8.48 release wrap; land on a
branch or after 0.8.48 ships). Author handoff doc, mirrors the style of
`REFACTOR_HANDOFF.md`.

## Goal

One command — `codewhale remote-setup` — that guides a user through standing up
a remote CodeWhale agent they can talk to from a phone chat app, across:

- **Cloud target:** Tencent Lighthouse **or** Azure (extensible to GCP/Hetzner/bare).
- **Chat bridge:** Feishu/Lark **or** Telegram (extensible to Slack/Discord).
- **Model provider:** any entry in the existing `PROVIDERS` registry
  (DeepSeek, OpenAI, NVIDIA NIM, Atlascloud, WanjieArk, OpenRouter, Novita,
  Fireworks, Moonshot, SGLang, vLLM, Ollama, Xiaomi).

Decisions locked with the user:
- **Form:** native Rust subcommand in-binary (touches `crates/cli` + `crates/tui`).
- **Scope:** generate the deploy bundle **and** optionally auto-provision via the
  cloud CLI (`az` / CNB), behind a confirmation gate.

## Prior art: Hermes Agent (reference only — do not copy)

`/Volumes/VIXinSSD/hermesagent` (Nous Research's Hermes Agent, Python) solves the
same problem and **validates this design**. Use it for ideas; keep CodeWhale's
style (Rust core, zero-dep Node bridges, plain-text replies).

- Its `gateway/platform_registry.py` is exactly the table-driven approach here: a
  `PlatformEntry { name, label, adapter_factory, check_fn, validate_config,
  required_env, install_hint, setup_fn, source }`. That maps 1:1 onto our
  `BridgeSpec`/`CloudTarget` rows, and its per-platform `setup_fn` + `required_env`
  are what our wizard reads to prompt. A single gateway process fans out to many
  platforms — the model we want.
- Its `gateway/pairing.py` mirrors our allowlist/first-pairing flow.

### Telegram hardening checklist (mined from `gateway/platforms/telegram.py`)

That adapter is battle-tested; its method names enumerate edge cases our MVP
bridge should handle. Status against `integrations/telegram-bridge`:

| Edge case | In Hermes | In our bridge |
|---|---|---|
| 409 polling conflict (two `getUpdates`) | `_looks_like_polling_conflict` | **done** — poll loop backs off 10s + warns |
| 429 `retry_after` | rate-limit handling | **done** — `telegramApi` honors `parameters.retry_after` |
| Forum "General topic = 1" send/typing asymmetry | `_message_thread_id_for_send` vs `_for_typing` | **done** — omit `message_thread_id` when id is 1 on send |
| "message to be replied not found" after restart | `_send_with_dm_topic_reply_anchor_retry` | **sidestepped** — we never set `reply_to_message_id` |
| Network/connect-timeout retry | `_looks_like_network_error` | partial — generic 3s backoff in poll loop |
| Text batching + progress-edit (edit one msg vs spam) | `test_telegram_text_batching` | **deferred** — we send a chunk every 15s |
| MarkdownV2 escaping + table rendering | `_escape_mdv2`, `_wrap_markdown_tables` | **deferred** — plain text (safe; tables look plain) |
| Webhook mode as an alternative to long polling | `_webhook_mode` | out of scope — long-poll only (no inbound ports) |

Deferred items are deliberate: progress-edit and MarkdownV2 add real UX polish
but also complexity and (for MDV2) a whole class of parser-escaping bugs. Revisit
after `remote-setup` lands.

## Design principle: table-driven, like `ProviderSpec`

The provider registry (`crates/config/src/lib.rs::PROVIDERS`) is the model to
copy: "adding a provider is one row." Apply the same to clouds and bridges so
the matrix grows by data, not by new control flow.

```
        CloudTarget  ×  BridgeSpec   +  ProviderSpec (existing registry)
        ───────────     ──────────      ────────────────────────────────
        lighthouse      feishu          deepseek / openai / nvidia-nim / …
        azure           telegram        (wizard reads PROVIDERS, prompts for
        (future…)       (future…)        that provider's env_keys[0])
```

Clean separation that the architecture already implies:
- **Provider = a `runtime.env` concern.** The runtime resolves the provider from
  `CODEWHALE_PROVIDER` and the provider's own key var. The bridge never needs to
  know which provider is behind the runtime — it only forwards `model` to
  `/v1/threads`. So "multi-provider" only touches `runtime.env` generation.
- **Cloud = where it runs + where secrets live.**
- **Bridge = pure transport** between a chat app and `127.0.0.1:7878`.

## Command surface

New variant in `crates/cli/src/lib.rs` `Commands`:

```rust
/// Provision and configure a remote CodeWhale agent (cloud + chat bridge).
RemoteSetup(RemoteSetupArgs),
```

`RemoteSetupArgs` (clap):

| Flag | Meaning |
|---|---|
| `--cloud <azure\|lighthouse>` | Skip the cloud prompt. |
| `--bridge <telegram\|feishu>` | Skip the bridge prompt. |
| `--provider <slug>` | Provider slug; validated against `PROVIDERS`. |
| `--out <dir>` | Bundle output dir (default `./codewhale-deploy/<cloud>-<bridge>`). |
| `--generate-only` | Emit the bundle, do not provision (default). |
| `--apply` | Run the cloud CLI to actually provision (the auto-provision path). |
| `--yes` | Skip the final confirmation gate (CI/non-interactive). |
| `--non-interactive` | Fail instead of prompting if any required value is missing. |

CLI delegates to the TUI binary exactly like `Serve`/`Setup` do
(`delegate_to_tui(&cli, &resolved_runtime, tui_args("remote-setup", args))`).
The implementation lives next to `run_setup` in `crates/tui/src/`.

## Code layout

New module `crates/tui/src/remote_setup/`:

```
remote_setup/
  mod.rs          # run_remote_setup(): wizard orchestration + dispatch
  registry.rs     # CloudTarget + BridgeSpec tables (the matrix)
  prompt.rs       # thin stdin prompt helpers (reuse existing patterns)
  bundle.rs       # render env files / systemd units / RUNBOOK.md to --out
  provision/
    mod.rs        # Provisioner trait + confirmation gate + dry-run printer
    azure.rs      # az preflight, RG, VM+cloud-init, Key Vault, NSG, start
    lighthouse.rs # cnb.yml + tag_deploy.yml generation, CNB guidance
  templates/      # runtime.env, <bridge>.env, *.service, cloud-init.yaml.tmpl
```

### Registry types

```rust
pub struct BridgeSpec {
    pub slug: &'static str,            // "telegram"
    pub display: &'static str,         // "Telegram"
    pub package_dir: &'static str,     // "integrations/telegram-bridge"
    pub service_unit: &'static str,    // "codewhale-telegram-bridge.service"
    pub env_template: &'static str,    // templates/telegram.env
    /// Bridge-specific secret env keys to prompt for (token, etc.).
    pub secret_keys: &'static [&'static str], // ["TELEGRAM_BOT_TOKEN"]
    /// One-liner shown before prompting (e.g. "Create a bot with @BotFather").
    pub setup_hint: &'static str,
}

pub struct CloudTarget {
    pub slug: &'static str,            // "azure"
    pub display: &'static str,         // "Azure VM"
    pub secret_store: SecretStore,     // KeyVault | EnvFile
    pub install: InstallMethod,        // Docker | NativeSystemd
    /// Builds the ordered list of provisioning steps as (description, command).
    /// Commands are returned as data so they can be dry-run printed, gated,
    /// and only then executed.
    pub plan: fn(&DeployInputs) -> Vec<ProvisionStep>,
}
```

A `ProvisionStep { description, program, args, secret_args }` is *data*, never a
shell string — so the confirmation gate can print every command, secrets are fed
via stdin/temp files (never argv/`history`), and `--apply` just executes the
already-printed plan.

## Wizard flow

1. **Cloud** — pick from `CLOUD_TARGETS` (or `--cloud`).
2. **Bridge** — pick from `BRIDGES` (or `--bridge`); print `setup_hint`.
3. **Provider** — list `PROVIDERS` (canonical names), pick (or `--provider`).
   Look up `spec.env_keys[0]` as the key var to prompt for.
4. **Secrets** — prompt for: provider API key, bridge token(s) from
   `secret_keys`, allowlist (chat ids). Generate a random `CODEWHALE_RUNTIME_TOKEN`.
5. **Mode** — generate-only vs `--apply`.
6. **Render bundle** to `--out` (always, even with `--apply`).
7. **Confirm + provision** (only if `--apply`): print the full ordered command
   plan, require `y` (unless `--yes`), then execute step by step with progress.
8. **Print RUNBOOK.md** path and the remaining manual steps.

## Generated bundle

Written to `./codewhale-deploy/<cloud>-<bridge>/`:

- `runtime.env` — **provider config lives here**:
  ```
  CODEWHALE_PROVIDER=openai
  OPENAI_API_KEY=…              # the provider's own key var, from registry
  CODEWHALE_MODEL=auto
  CODEWHALE_RUNTIME_TOKEN=<random>
  CODEWHALE_RUNTIME_PORT=7878
  CODEWHALE_RUNTIME_WORKERS=2
  RUST_LOG=info
  ```
- `<bridge>.env` — transport only: `CODEWHALE_RUNTIME_URL=http://127.0.0.1:7878`,
  matching `CODEWHALE_RUNTIME_TOKEN`, allowlist, `TELEGRAM_BOT_TOKEN` (or Feishu
  app id/secret), `CODEWHALE_WORKSPACE`, `CODEWHALE_MODEL`.
- `codewhale-runtime.service`, `codewhale-<bridge>.service`.
- Cloud artifact: `cloud-init.yaml` + `provision.sh` (Azure) or `cnb.yml` +
  `tag_deploy.yml` (Lighthouse).
- `RUNBOOK.md` — the exact remaining commands + first-pairing steps.

## Auto-provision

### Azure (`--apply --cloud azure`)
Preflight: `az account show` (fail with "run `az login`" if absent). Then the
`plan()` emits, in order:
1. `az group create` (region prompted; default `eastus`).
2. `az keyvault create` + `az keyvault secret set` for the provider key and the
   runtime token (secrets via stdin, not argv).
3. `az vm create` with `--custom-data cloud-init.yaml` and a **system-assigned
   managed identity**; cloud-init pulls `ghcr.io/hmbown/codewhale:latest`, reads
   the secrets from Key Vault via the identity, writes `/etc/codewhale/*.env`,
   installs both systemd units, `enable --now`.
4. NSG: SSH (22) only, scoped to the caller's IP; **7878 stays on `127.0.0.1`**.
5. Print the SSH tunnel command for `/status` from a laptop if desired.

### Lighthouse (`--apply --cloud lighthouse`)
Reuse the existing `deploy/tencent-lighthouse/cnb/*.example` pipeline: render
`cnb.yml` + `tag_deploy.yml` from inputs and walk the user through the CNB
trigger (CNB does the VM-side work). Systemd units mirror the existing
`codewhale-runtime.service`.

Safety (matches the harness rules for outward-facing actions):
- Every command printed before execution; `y` gate unless `--yes`.
- Secrets never in argv or shell history.
- `--generate-only` is the default; `--apply` is explicit.

## Namespace migration: `DEEPSEEK_*` → `CODEWHALE_*`

Follow the convention already in `crates/config/src/lib.rs`: **read
`CODEWHALE_X` first, fall back to `DEEPSEEK_X`.** Nothing breaks for existing
deployments.

Touch list:
1. **Bridges** (`integrations/feishu-bridge`, `integrations/telegram-bridge`):
   in `lib.mjs`/`index.mjs`, read `process.env.CODEWHALE_X ?? process.env.DEEPSEEK_X`
   for `RUNTIME_URL`, `RUNTIME_TOKEN`, `WORKSPACE`, `MODEL`, `MODE`, `ALLOW_SHELL`,
   `TRUST_MODE`, `AUTO_APPROVE`, `CHAT_ALLOWLIST`, `ALLOW_UNLISTED`, `TURN_TIMEOUT_MS`.
   Validators accept either; templates emit `CODEWHALE_*`.
2. **Deploy units** (`deploy/tencent-lighthouse/systemd/*`,
   `integrations/*/deploy/*`): `DEEPSEEK_RUNTIME_*` → `CODEWHALE_RUNTIME_*`,
   env file paths `/etc/deepseek/` → `/etc/codewhale/` (keep reading the old path
   if present).
3. **`.env.example` files + `config.example.toml`**: lead with `CODEWHALE_*`,
   document `DEEPSEEK_*` as legacy aliases.
4. **Drop DeepSeek-shaped defaults** in the bridge: no hardcoded
   `DEEPSEEK_MODEL=auto`; the provider lives in `runtime.env` via
   `CODEWHALE_PROVIDER` + the registry's key var.

Note: items 1–3 touch **tracked** files, so they are part of the same
"don't ship during 0.8.48" hold. The brand-new (untracked) Telegram bridge can
be converted to `CODEWHALE_*` first as the reference implementation.

## Tests

- `registry.rs`: every `CloudTarget`/`BridgeSpec` slug is unique; each bridge's
  `package_dir`/`service_unit`/`env_template` exists.
- `bundle.rs`: rendering a bundle for each cloud×bridge×provider triple produces
  files with `CODEWHALE_*` keys, a matching runtime/bridge token, and a non-empty
  RUNBOOK.
- `provision`: `plan()` returns the expected ordered steps; **commands are built
  but never executed** in tests (assert on program+args, secrets redacted).

## Extensibility check

- Add **GCP**: one `CloudTarget` row + a `provision/gcp.rs` + a cloud-init reuse.
- Add **Slack**: one `BridgeSpec` row + `integrations/slack-bridge` + template.
No changes to the wizard control flow — it iterates the registries.

## Suggested sequencing (given the 0.8.48 freeze)

1. **Now (safe, untracked):** convert the new Telegram bridge to `CODEWHALE_* ??
   DEEPSEEK_*`; finalize this design.
2. **Post-0.8.48, branch:** namespace migration on tracked bridges + deploy units.
3. **Then:** implement `remote-setup` (registry → bundle → Azure provisioner →
   Lighthouse provisioner), generate-only first, `--apply` second.