Set up Gateway-Pool Failover
Setup
1. Prerequisites
- At least 2 gateway peers configured (
/peers, peer_type=gateway) - Both gateways must be able to reach the backend hosts in their LAN
- License:
gateway_pools=true(failover is usually included in the base plan, load balancing often requires Pro)
2. Create the Pool
/gateway-pools → "Create pool":
- Name: Meaningful identifier (e.g.
Home network) - Mode:
Failover (prioritized)— primary gateway serves, backup only takes over on failureLoad balancing— all alive members serve in parallel (license-dependent)
- LB policy (load balancing only):
round_robin— even distribution by orderleast_conn— member with fewest active connectionsip_hash— sticky per client IP (same client → same member)
- Failback cooldown: how long to wait after recovery before routes migrate back. Presets:
60 s— Linux container (LXC), fast reboot180 s— Linux VM600 s(10 min) — Proxmox host900 s(15 min) — Synology / QNAP NAS1800 s(30 min) — Windows server3600 s(60 min) — Conservative
- Outage message (optional): custom 503 body when ALL members are down
3. Add Members
In the modal on the right:
- Pick a gateway from the dropdown → "Add"
- Position #1 is primary (highest priority). Reorder via drag and drop.
- Already added gateways disappear from the dropdown (no duplicates possible)
- "Save" persists atomically (
PUT /api/v1/gateway-pools/:id/members)
4. Wire Up Routes
For failover: Nothing to do. As soon as a route's target_peer_id belongs to a pool member, the failover logic kicks in automatically on the next state change.
For load balancing:
/gateway-pools → "Migrate routes":
- Pick the target pool
- The list shows all gateway-pinned routes, grouped by source peer
- Loopback routes (
127.0.0.1) are highlighted yellow and unchecked by default — reason:ssh.example.com → 127.0.0.1:22means "on the gateway machine itself". Migrating that to another member changes the destination machine. - Review the selection, "OK" → atomic DB update + Caddy resync
Alternatively: edit routes individually via /routes and set target_kind: pool, picking the target pool.
Failover Mode in Detail
Mechanism
Normal operation:
routes.target_peer_id = 79 (Home)
routes.original_peer_id = NULL
↓
Frontend Caddy → 10.8.0.8:18080 (Home companion Caddy)
→ 192.168.1.5:5001 (NAS on Home LAN)
Home goes down:
watchdogTick (every 30 s) → evaluatePeer → alive=0 → transition='alive_to_down'
↓
_onTransition('alive_to_down', peerId=79):
UPDATE routes
SET target_peer_id = 84, -- DS918, highest priority alive
original_peer_id = 79, -- remember source
updated_at = NOW
WHERE target_peer_id = 79 AND original_peer_id IS NULL AND target_kind = 'gateway'
↓
syncToCaddy() -- Caddy reload with new upstream
notifyConfigChanged(79) -- Home no longer holds the routes
notifyConfigChanged(84) -- DS918 picks them up
↓
Frontend Caddy → 10.8.0.2:18080 (DS918 companion Caddy)
→ 192.168.1.5:5001 (NAS, now reached via DS918)
Home recovers:
watchdogTick → cooldown_to_alive → transition fires AFTER failback_cooldown_s
↓
_onTransition('cooldown_to_alive', peerId=79):
UPDATE routes
SET target_peer_id = original_peer_id,
original_peer_id = NULL,
updated_at = NOW
WHERE original_peer_id = 79
↓
syncToCaddy() + notifyConfigChanged for 79 + 84
↓
Routes back on Home, business as usual.
Boot Reconcile
Transitions only fire on state changes. If a peer is already offline at container start, there's no alive_to_down → no pivot. gatewayHealth.reconcileFailoverState() runs once at server boot to catch up:
- For each offline pool member: find an alive sibling, pivot routes
- For each route with
original_peer_id != NULLwhose original peer is alive again: migrate the route back
This keeps the DB consistent with the real health state after a container restart.
Observability
Activity log events:
gateway_down/gateway_alive— state change per peerpool_failover_activated— routes pivoted onto a sibling (withfromPeerId/toPeerId)pool_failover_restored— routes restored to the original peerpool_outage_started/pool_outage_resolved— all/first pool member offline
Webhook (if configured): gateway_state_change with payload {peer_id, alive: bool}.
SQL for the current state:
SELECT id, domain, target_peer_id, original_peer_id
FROM routes WHERE target_kind='gateway' AND enabled=1;
A single row tells you who routes where — original_peer_id != NULL means "currently in failover".
Example Setup: Home Network Failover
Typical home setup with two gateways: a Synology NAS (DS918, 192.168.2.151) and a Linux mini-PC ("Home", 192.168.2.5). Both on the same LAN, both can reach all backend hosts.
1. /peers → create "Home Gateway" and "DS918 Gateway", both enabled
2. /gateway-pools → create pool "Home network":
- Mode: Failover
- Failback cooldown: 900s (NAS preset, since DS918 has longer updates)
- Members: Home (position #1), DS918 (position #2)
3. Existing routes need NOTHING — failover kicks in automatically
4. Test:
- Reboot DS918 → Home as primary stays online → no impact
- Reboot Home → routes migrate to DS918 → services stay reachable
- Home recovers → after the 15 min cooldown, back to Home
For pure failover, stick with target_peer_id pinning. Only when you want load balancing (e.g. streaming + backup in parallel across both gateways) is the "Migrate routes" function needed.