Set up Gateway-Pool Failover

Setup

1. Prerequisites

At least 2 gateway peers configured (/peers, peer_type=gateway)
Both gateways must be able to reach the backend hosts in their LAN
License: gateway_pools=true (failover is usually included in the base plan, load balancing often requires Pro)

2. Create the Pool

/gateway-pools → "Create pool":

Name: Meaningful identifier (e.g. Home network)
Mode:
- Failover (prioritized) — primary gateway serves, backup only takes over on failure
- Load balancing — all alive members serve in parallel (license-dependent)
LB policy (load balancing only):
- round_robin — even distribution by order
- least_conn — member with fewest active connections
- ip_hash — sticky per client IP (same client → same member)
Failback cooldown: how long to wait after recovery before routes migrate back. Presets:
- 60 s — Linux container (LXC), fast reboot
- 180 s — Linux VM
- 600 s (10 min) — Proxmox host
- 900 s (15 min) — Synology / QNAP NAS
- 1800 s (30 min) — Windows server
- 3600 s (60 min) — Conservative
Outage message (optional): custom 503 body when ALL members are down

3. Add Members

In the modal on the right:

Pick a gateway from the dropdown → "Add"
Position #1 is primary (highest priority). Reorder via drag and drop.
Already added gateways disappear from the dropdown (no duplicates possible)
"Save" persists atomically (PUT /api/v1/gateway-pools/:id/members)

4. Wire Up Routes

For failover: Nothing to do. As soon as a route's target_peer_id belongs to a pool member, the failover logic kicks in automatically on the next state change.

For load balancing:

/gateway-pools → "Migrate routes":

Pick the target pool
The list shows all gateway-pinned routes, grouped by source peer
Loopback routes (127.0.0.1) are highlighted yellow and unchecked by default — reason: ssh.example.com → 127.0.0.1:22 means "on the gateway machine itself". Migrating that to another member changes the destination machine.
Review the selection, "OK" → atomic DB update + Caddy resync

Alternatively: edit routes individually via /routes and set target_kind: pool, picking the target pool.

Failover Mode in Detail

Mechanism

Normal operation:
  routes.target_peer_id = 79 (Home)
  routes.original_peer_id = NULL
  ↓
  Frontend Caddy → 10.8.0.8:18080 (Home companion Caddy)
  → 192.168.1.5:5001 (NAS on Home LAN)

Home goes down:
  watchdogTick (every 30 s) → evaluatePeer → alive=0 → transition='alive_to_down'
  ↓
  _onTransition('alive_to_down', peerId=79):
    UPDATE routes
    SET target_peer_id = 84,         -- DS918, highest priority alive
        original_peer_id = 79,       -- remember source
        updated_at = NOW
    WHERE target_peer_id = 79 AND original_peer_id IS NULL AND target_kind = 'gateway'
    ↓
    syncToCaddy()  -- Caddy reload with new upstream
    notifyConfigChanged(79)  -- Home no longer holds the routes
    notifyConfigChanged(84)  -- DS918 picks them up
  ↓
  Frontend Caddy → 10.8.0.2:18080 (DS918 companion Caddy)
  → 192.168.1.5:5001 (NAS, now reached via DS918)

Home recovers:
  watchdogTick → cooldown_to_alive → transition fires AFTER failback_cooldown_s
  ↓
  _onTransition('cooldown_to_alive', peerId=79):
    UPDATE routes
    SET target_peer_id = original_peer_id,
        original_peer_id = NULL,
        updated_at = NOW
    WHERE original_peer_id = 79
    ↓
    syncToCaddy() + notifyConfigChanged for 79 + 84
  ↓
  Routes back on Home, business as usual.

Boot Reconcile

Transitions only fire on state changes. If a peer is already offline at container start, there's no alive_to_down → no pivot. gatewayHealth.reconcileFailoverState() runs once at server boot to catch up:

For each offline pool member: find an alive sibling, pivot routes
For each route with original_peer_id != NULL whose original peer is alive again: migrate the route back

This keeps the DB consistent with the real health state after a container restart.

Observability

Activity log events:

gateway_down / gateway_alive — state change per peer
pool_failover_activated — routes pivoted onto a sibling (with fromPeerId/toPeerId)
pool_failover_restored — routes restored to the original peer
pool_outage_started / pool_outage_resolved — all/first pool member offline

Webhook (if configured): gateway_state_change with payload {peer_id, alive: bool}.

SQL for the current state:

SELECT id, domain, target_peer_id, original_peer_id
FROM routes WHERE target_kind='gateway' AND enabled=1;

A single row tells you who routes where — original_peer_id != NULL means "currently in failover".

Example Setup: Home Network Failover

Typical home setup with two gateways: a Synology NAS (DS918, 192.168.2.151) and a Linux mini-PC ("Home", 192.168.2.5). Both on the same LAN, both can reach all backend hosts.

1. /peers → create "Home Gateway" and "DS918 Gateway", both enabled
2. /gateway-pools → create pool "Home network":
   - Mode: Failover
   - Failback cooldown: 900s (NAS preset, since DS918 has longer updates)
   - Members: Home (position #1), DS918 (position #2)
3. Existing routes need NOTHING — failover kicks in automatically
4. Test:
   - Reboot DS918 → Home as primary stays online → no impact
   - Reboot Home → routes migrate to DS918 → services stay reachable
   - Home recovers → after the 15 min cooldown, back to Home

For pure failover, stick with target_peer_id pinning. Only when you want load balancing (e.g. streaming + backup in parallel across both gateways) is the "Migrate routes" function needed.