# OpenClaw Reliability & Redundancy Plan
Created: 2026-03-08

## Context
Mike is frustrated with recurring stalls/outages (March 2, 3, 4, 8). Customers will rely on Harvey + Leah. Need production-grade reliability.

## Root Causes of Current Issues
1. **MacBook Air as gateway host** - WiFi drops, sleep mode, lid closes = gateway goes dark
2. **Config validation errors** - Slack allowFrom misconfigured, required `openclaw doctor --fix`
3. **Stale websocket connections** - Telegram polling drops silently
4. **No monitoring** - Mike only discovers outages when he messages and gets silence
5. **Running outdated version** (2026.3.2 vs 2026.3.7 available)

## Recommended Actions (Priority Order)

### 1. IMMEDIATE: Update OpenClaw (5 min)
- Current: 2026.3.2, Available: 2026.3.7
- Five versions of bug fixes we're missing
- Command: `openclaw update`

### 2. SHORT-TERM: Move Gateway to a VPS (1-2 hours)
**This is the single biggest reliability improvement.**

Options (cheapest to most capable):
- **Oracle Cloud Always Free**: $0/month, 4 OCPU ARM, 24GB RAM (signup can be finicky)
- **Hetzner CX22**: ~$4/month, 2 vCPU, 4GB RAM (best price/perf)
- **DigitalOcean**: $6/month, 1 vCPU, 1GB RAM (easiest setup)

Benefits:
- Always-on (no sleep, no WiFi drops)
- systemd auto-restart on crash (`Restart=always`, `RestartSec=2`)
- Ethernet connection (no wireless flakiness)
- Survives MacBook being closed/off/away

Access from MacBook via:
- Tailscale (already familiar with it)
- SSH tunnel for dashboard
- Telegram/Slack channels work the same

### 3. SHORT-TERM: Health Monitoring & Alerts (30 min)
**So Mike knows immediately when something breaks, not hours later.**

Options:
a) **Cron-based self-check**: Add a cron job that runs `openclaw health --json` every 5 min. If unhealthy, send alert to Mike via Telegram/Slack.
b) **External uptime monitor**: Free services like UptimeRobot or BetterStack can ping the gateway health endpoint.
c) **Heartbeat watchdog**: If Harvey misses 2+ heartbeat cycles, the gateway can auto-restart.

On a VPS with systemd, the service auto-restarts on crash. Combined with monitoring, downtime drops from "hours until Mike notices" to "seconds to auto-recover, Mike gets notified."

### 4. SHORT-TERM: Config Protection (15 min)
Prevent config-related outages:
- Run `openclaw doctor` after any config change to validate before restart
- Keep config-guard.sh backups (already have this)
- Fix the Slack allowFrom properly so it validates clean
- Consider `openclaw doctor --non-interactive` as a pre-restart hook

### 5. MEDIUM-TERM: Rescue Bot (optional, 30 min)
OpenClaw supports running a second "rescue" gateway on the same host:
- Separate profile, port, config, state
- If primary bot is down, rescue bot can diagnose/fix
- Useful for remote VPS where you can't always SSH in
- Command: `openclaw --profile rescue onboard`

### 6. MEDIUM-TERM: Model Failover (already built in!)
OpenClaw already handles:
- Auth profile rotation (multiple API keys)
- Model fallback chain (`agents.defaults.model.fallbacks`)
- Exponential backoff on rate limits
- Billing disable detection
- We should configure fallback models (e.g., Sonnet as fallback for Opus)

### 7. MEDIUM-TERM: Backup Strategy
On VPS:
- Git-backed workspace (push to private GitHub repo)
- Periodic tar backup of ~/.openclaw/ (config, creds, sessions)
- Can automate with cron

## Recommendation
**Priority 1: Update to 2026.3.7 NOW.**
**Priority 2: Move to Hetzner VPS ($4/month). Best reliability-to-cost ratio.**
**Priority 3: Set up health monitoring alerts.**

These three things together would eliminate ~95% of the outage issues we've been having.

## Cost
- Hetzner CX22: ~$4 CAD/month
- Total new cost: ~$4/month for dramatically better uptime
- Oracle Cloud free tier is $0 but ARM + signup friction
