Gateway Pools
Table of Contents
- Problem & Motivation
- Concepts
- Setup
- Failover Mode in Detail
- Load Balancing Mode in Detail
- Health Monitoring
- Edge Cases
- Troubleshooting
- Architecture Reference
Problem & Motivation
GateControl routes normally pin a domain (nas.example.com) to exactly one gateway peer. The frontend Caddy on the GateControl server proxies to <gateway-ip>:<proxy-port> over the WireGuard tunnel; the companion Caddy on the gateway forwards to the local backend address (192.168.1.5:5001, etc.).
What goes wrong without pools:
- Reboots of a gateway (Linux updates, Synology updates, Proxmox reboots) make all domains running through it unreachable for the duration of the reboot.
- Hardware failure = full outage until manually reconfigured.
- Scheduled maintenance windows require manually re-pinning every route.
What pools solve:
A gateway pool groups multiple gateway peers that can reach the same backend hosts. If one member fails, another takes over automatically — either via failover (priority order) or load balancing (parallel distribution).
Important: Pool members must actually be able to reach the backend hosts (
192.168.1.5, etc.). If gateways sit on different LANs, failover does not help, because the failover member cannot reach the backend host at all.
Concepts
| Term | Meaning |
|---|---|
| Pool | Logical group of gateway peers (DB: gateway_pools) |
| Mode | failover or load_balancing — unique per pool |
| Member | Peer with priority within a pool (DB: gateway_pool_members) |
| Priority | Lower = higher priority. Position 1 is primary. |
failback_cooldown_s |
After recovery: wait time before routes migrate back to the original member |
outage_message |
Custom 503 body when all members are down |
target_peer_id |
The peer ID currently serving a route (may change during failover) |
original_peer_id |
Where the route was originally pinned — set during failover, reset on recovery |
target_pool_id |
Alternative routing mode: route is bound explicitly to the pool (used for load balancing) |
Two Routing Paths
GateControl supports two separate ways for a route to use a pool:
A) Peer-pinned with implicit failover (most common case)
- Route has
target_peer_id=<gw>,target_pool_id=NULL - Frontend Caddy reads
target_peer_iddirectly from the DB - On failure:
gatewayHealth._onTransitionrewritestarget_peer_idto the next-highest-priority alive pool member, remembering the original inoriginal_peer_id - No Caddy-side runtime resolution needed — the DB is the source of truth
B) Pool-routed via Caddy resolver (for load balancing)
- Route has
target_pool_id=<pool>,target_peer_id=NULL - Frontend Caddy resolves at runtime via
gatewayPool.resolveActivePeer(s)using the health snapshot - For load balancing, Caddy receives all alive members as
upstreams[]plus the selection policy - For failover mode at the pool level, exactly one member is selected (highest priority alive)
For plain failover you don't need to migrate anything. Pool membership is enough. For load balancing, you must explicitly switch routes to the pool via "Migrate routes".
Setup
1. Prerequisites
- At least 2 gateway peers configured (
/peers, peer_type=gateway) - Both gateways must be able to reach the backend hosts in their LAN
- License:
gateway_pools=true(failover is usually included in the base plan, load balancing often requires Pro)
2. Create the Pool
/gateway-pools → "Create pool":
- Name: Meaningful identifier (e.g.
Home network) - Mode:
Failover (prioritized)— primary gateway serves, backup only takes over on failureLoad balancing— all alive members serve in parallel (license-dependent)
- LB policy (load balancing only):
round_robin— even distribution by orderleast_conn— member with fewest active connectionsip_hash— sticky per client IP (same client → same member)
- Failback cooldown: how long to wait after recovery before routes migrate back. Presets:
60 s— Linux container (LXC), fast reboot180 s— Linux VM600 s(10 min) — Proxmox host900 s(15 min) — Synology / QNAP NAS1800 s(30 min) — Windows server3600 s(60 min) — Conservative
- Outage message (optional): custom 503 body when ALL members are down
3. Add Members
In the modal on the right:
- Pick a gateway from the dropdown → "Add"
- Position #1 is primary (highest priority). Reorder via drag and drop.
- Already added gateways disappear from the dropdown (no duplicates possible)
- "Save" persists atomically (
PUT /api/v1/gateway-pools/:id/members)
4. Wire Up Routes
For failover: Nothing to do. As soon as a route's target_peer_id belongs to a pool member, the failover logic kicks in automatically on the next state change.
For load balancing:
/gateway-pools → "Migrate routes":
- Pick the target pool
- The list shows all gateway-pinned routes, grouped by source peer
- Loopback routes (
127.0.0.1) are highlighted yellow and unchecked by default — reason:ssh.example.com → 127.0.0.1:22means "on the gateway machine itself". Migrating that to another member changes the destination machine. - Review the selection, "OK" → atomic DB update + Caddy resync
Alternatively: edit routes individually via /routes and set target_kind: pool, picking the target pool.
Failover Mode in Detail
Mechanism
Normal operation:
routes.target_peer_id = 79 (Home)
routes.original_peer_id = NULL
↓
Frontend Caddy → 10.8.0.8:18080 (Home companion Caddy)
→ 192.168.1.5:5001 (NAS on Home LAN)
Home goes down:
watchdogTick (every 30 s) → evaluatePeer → alive=0 → transition='alive_to_down'
↓
_onTransition('alive_to_down', peerId=79):
UPDATE routes
SET target_peer_id = 84, -- DS918, highest priority alive
original_peer_id = 79, -- remember source
updated_at = NOW
WHERE target_peer_id = 79 AND original_peer_id IS NULL AND target_kind = 'gateway'
↓
syncToCaddy() -- Caddy reload with new upstream
notifyConfigChanged(79) -- Home no longer holds the routes
notifyConfigChanged(84) -- DS918 picks them up
↓
Frontend Caddy → 10.8.0.2:18080 (DS918 companion Caddy)
→ 192.168.1.5:5001 (NAS, now reached via DS918)
Home recovers:
watchdogTick → cooldown_to_alive → transition fires AFTER failback_cooldown_s
↓
_onTransition('cooldown_to_alive', peerId=79):
UPDATE routes
SET target_peer_id = original_peer_id,
original_peer_id = NULL,
updated_at = NOW
WHERE original_peer_id = 79
↓
syncToCaddy() + notifyConfigChanged for 79 + 84
↓
Routes back on Home, business as usual.
Boot Reconcile
Transitions only fire on state changes. If a peer is already offline at container start, there's no alive_to_down → no pivot. gatewayHealth.reconcileFailoverState() runs once at server boot to catch up:
- For each offline pool member: find an alive sibling, pivot routes
- For each route with
original_peer_id != NULLwhose original peer is alive again: migrate the route back
This keeps the DB consistent with the real health state after a container restart.
Observability
Activity log events:
gateway_down/gateway_alive— state change per peerpool_failover_activated— routes pivoted onto a sibling (withfromPeerId/toPeerId)pool_failover_restored— routes restored to the original peerpool_outage_started/pool_outage_resolved— all/first pool member offline
Webhook (if configured): gateway_state_change with payload {peer_id, alive: bool}.
SQL for the current state:
SELECT id, domain, target_peer_id, original_peer_id
FROM routes WHERE target_kind='gateway' AND enabled=1;
A single row tells you who routes where — original_peer_id != NULL means "currently in failover".
Load Balancing Mode in Detail
Activation
A route uses LB only when all three conditions are true:
route.target_pool_idis set (NOTtarget_peer_id)- Pool
mode = 'load_balancing' - Pool
lb_policy ∈ {round_robin, least_conn, ip_hash}
Caddy Configuration (HTTP)
During the build cycle, caddyConfig.resolveRouteUpstreams calls gatewayPool.resolveActivePeers(snapshot) — which returns all alive pool members:
"reverse_proxy": {
"upstreams": [
{ "dial": "10.8.0.2:18080" },
{ "dial": "10.8.0.8:18080" }
],
"load_balancing": {
"selection_policy": { "policy": "round_robin" }
},
"health_checks": {
"passive": {
"fail_duration": "30s",
"max_fails": 3,
"unhealthy_status": [500, 502, 503, 504]
}
}
}
LB Policies
| Policy | Behavior | When to use |
|---|---|---|
round_robin |
Round-robin through the upstream list | Symmetric load, identical members |
least_conn |
Member with fewest active connections | Long streams (Jellyfin streaming, backups) |
ip_hash |
Hash over client IP → fixed member | Session affinity without cookies (legacy apps) |
Trusted Proxies (for ip_hash behind a CDN/LB)
If GateControl sits behind a private LB or a CDN, the server Caddy only sees the proxy IP as the remote IP — ip_hash would then be ineffective (all requests hit the same bucket). The workaround is built into the srv0 HTTP server:
{
"trusted_proxies": {
"source": "static",
"ranges": [
"10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16",
"100.64.0.0/10", "fd00::/8", "::1/128", "127.0.0.0/8"
]
},
"client_ip_headers": ["X-Forwarded-For"]
}
X-Forwarded-For is only honored when the connection originates from one of the listed ranges. Clients hitting GateControl directly from the internet cannot spoof XFF, because their source IP is not in the trust list.
Passive Health Checks
In addition to the gatewayHealth watchdog (every 30 s), Caddy itself checks passively:
- Sends a request to member X
- X returns a 5xx status (or fails to connect)
- After
max_fails=3such failures, X is removed from rotation forfail_duration=30s - After 30 s, retried
This makes circuit breaking kick in within seconds, instead of waiting for the global health logic's 90 s gateway_down_threshold.
L4 Load Balancing (TCP/UDP)
Works the same way for L4 routes — caddy-l4's proxy handler accepts multiple upstreams[] plus load_balancing.selection_policy:
{
"handler": "proxy",
"upstreams": [
{ "dial": ["10.8.0.2:13389"] },
{ "dial": ["10.8.0.8:13389"] }
],
"load_balancing": {
"selection_policy": { "policy": "round_robin" }
},
"health_checks": {
"passive": { "fail_duration": "30s", "max_fails": 3 }
}
}
L4 routes must have route_type='l4' and be pool-bound via target_pool_id.
Health Monitoring
Heartbeat Model
- The companion on each gateway sends a heartbeat (HTTPS) to the server every ~10 s
- Server stores
gateway_meta.last_seen_at evaluatePeer(every 30 s per watchdog tick) checksnow - last_seen_atagainstgateway_down_threshold_s(default 90 s)
State Machine
evaluatePeer() evaluatePeer()
│ │
▼ ▼
┌───────────┐ isStale ┌───────────┐
│ alive=1 │ ────────────▶ │ alive=0 │
│ │ │ │
│ (alive) │ ◀──────────── │ (down) │
└───────────┘ cooldown └───────────┘
_to_alive │
▲ │ heartbeat received
│ │ (still within threshold)
│ ▼
│ ┌───────────┐
│ │ alive=0 │
│ │recovered_ │
│ │first_hb_at│
└───────── │(cooldown) │
└───────────┘
│
│ heartbeat gap
▼
cooldown_reset
(back to down)
Transitions:
alive_to_down— peer was alive, now stale → pivot routesdown_to_cooldown— peer was down, first heartbeat arrived → start cooldown timercooldown_to_alive— cooldown period elapsed without a heartbeat gap → recovery, restore routescooldown_reset— heartbeat gap during cooldown → back to down (no restore!)first_alive— very first heartbeat with nowent_down_atmarker (e.g. after a DB wipe)
failback_cooldown_s is configurable per pool — getMaxCooldownForPeer(peerId) takes the maximum across all pools the peer belongs to.
Snapshot
gatewayHealth.getSnapshot() is the in-memory cache with {[peerId]: {alive, last_seen_at, went_down_at, recovered_first_hb_at}}. Updated by evaluatePeer. caddyConfig.resolveRouteUpstreams reads it during the build.
Boot edge case: snapshot starts empty. resolveRouteUpstreams seeds it from the DB (gateway_meta.alive) so pool routes resolve correctly during the boot Caddy build instead of returning 503.
Edge Cases
All Members Offline
resolveActivePeer returns null / resolveActivePeers returns [] → resolveRouteUpstreams returns outage: true → caddyConfig renders a 503 static response with outage_message (instead of a reverse_proxy to a dead backend). Users see a controlled outage page.
Multiple Failovers
T=0: Home alive, DS918 alive → routes target_peer_id=79
T=1: Home → down → routes target_peer_id=84, original=79
T=2: DS918 → down (Home still down) → no pivot (no alive sibling), routes stay on 84
T=3: Home → recovered → routes target_peer_id=79, original=NULL
(pivot logic finds original=79 → restored)
The WHERE original_peer_id IS NULL guard in the failover UPDATE prevents "double pivot" — a route already pivoted away is NOT touched again on a second failover. This way original_peer_id always points to the truly original pin.
Removing a Peer From a Pool While Routes Are Pinned to It
gatewayPool.replaceMembers checks when emptying a pool (members.length === 0):
- Are there still routes with
target_pool_id = poolId? - Are there still RDP routes with
gateway_pool_id = poolId? - If yes →
last_member_in_useerror, save aborted
For target_peer_id (peer-pinned without pool binding), removing from the pool is not blocked — the route stays pinned to the peer, just without failover protection.
Peer in Multiple Pools
listPoolsForPeer(peerId) returns all pools the peer belongs to. _onTransition iterates them all:
- On the first pool with an alive sibling: pivot routes, break out of the loop
- Subsequent pools would find nothing on this pivot anyway (
WHERE original_peer_id IS NULLno longer matches)
Multi-pool membership is supported, but only one pool wins per state change.
Changing Priority via Drag and Drop
PUT /api/v1/gateway-pools/:id/members with the full member list in DOM order. Backend acts atomically:
DELETE FROM gateway_pool_members WHERE pool_id = ?;
INSERT INTO gateway_pool_members ... -- per member, new priority = index+1
in a transaction. UI order = new priority. Then applyPoolMutationWithSequencing: a single Caddy sync + companion confirm.
Architecture Reference
File Map
| File | Responsibility |
|---|---|
src/services/gatewayPool.js |
Pool/member CRUD, replaceMembers (atomic), resolveActivePeer(s) |
src/services/gatewayHealth.js |
Heartbeat watchdog, _onTransition (DB pivot on state change), reconcileFailoverState (boot) |
src/services/caddyConfig.js |
buildCaddyConfig, resolveRouteUpstreams (for target_pool_id), HTTP LB + passive HC |
src/services/l4.js |
L4 Caddy server config including multi-upstream + LB |
src/services/gatewayPoolSync.js |
applyPoolMutationWithSequencing — companion confirm + Caddy sync after pool mutation |
src/services/gateways.js |
getGatewayConfig (companion pull), notifyConfigChanged (server push), hash computing |
src/routes/api/gatewayPools.js |
REST API: CRUD, members bulk PUT, migrate-routes |
src/db/migrationList.js |
v40: pools+members, v42: proxy_port, v43: routes.original_peer_id |
DB Schema (relevant)
CREATE TABLE gateway_pools (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
mode TEXT NOT NULL, -- 'failover' | 'load_balancing'
lb_policy TEXT, -- NULL for failover, otherwise round_robin/least_conn/ip_hash
failback_cooldown_s INTEGER NOT NULL,
outage_message TEXT,
enabled INTEGER NOT NULL DEFAULT 1,
created_at, updated_at
);
CREATE TABLE gateway_pool_members (
pool_id INTEGER REFERENCES gateway_pools(id) ON DELETE CASCADE,
peer_id INTEGER REFERENCES peers(id) ON DELETE CASCADE,
priority INTEGER NOT NULL,
PRIMARY KEY (pool_id, peer_id)
);
-- routes table (relevant for pools):
-- target_kind TEXT 'gateway' for gateway routes
-- target_peer_id INTEGER currently serving peer (may change via pivot)
-- target_pool_id INTEGER alternative: explicit pool routing (LB)
-- original_peer_id INTEGER v43: original pin for failover restore (NULL = not in failover)
CREATE TABLE gateway_meta (
peer_id INTEGER PRIMARY KEY,
api_port INTEGER NOT NULL DEFAULT 9876,
api_token_hash TEXT NOT NULL,
push_token_encrypted TEXT NOT NULL,
alive INTEGER NOT NULL DEFAULT 1,
last_seen_at INTEGER,
went_down_at INTEGER,
recovered_first_hb_at INTEGER,
last_config_hash TEXT,
proxy_port INTEGER NOT NULL DEFAULT 8080, -- v42: per-peer proxy port (Synology DSM uses 18080)
created_at INTEGER NOT NULL
);
REST API (excerpt)
GET /api/v1/gateway-pools List pools
POST /api/v1/gateway-pools Create pool
GET /api/v1/gateway-pools/:id Get pool with members
PUT /api/v1/gateway-pools/:id Update pool
DELETE /api/v1/gateway-pools/:id Delete pool (rejected if in use)
GET /api/v1/gateway-pools/:id/members List members
POST /api/v1/gateway-pools/:id/members Add single member
PUT /api/v1/gateway-pools/:id/members Bulk replace member set (atomic)
PUT /api/v1/gateway-pools/:id/members/:peerId Set priority of a single member
DELETE /api/v1/gateway-pools/:id/members/:peerId Remove member
GET /api/v1/gateway-pools/migration-candidates List peer-pinned routes + pools (for the UI)
POST /api/v1/gateway-pools/:id/migrate-routes { route_ids: [...] } — bulk migrate
Configuration
gateway_down_threshold_s (in the settings table) — global, default 90 s. Lower = faster reaction to outages, but more sensitive to network hiccups. Adjustable via /settings → "Gateway failover" (UI limit: 30–600 s).
Active-Active Pattern (Two Pools With Mirrored Priority)
A single pool has exactly one mode. Active-active is achieved with two pools using mirrored priority:
| Pool | GW1 prio | GW2 prio |
|---|---|---|
| Home network A | 1 | 2 |
| Home network B | 2 | 1 |
Routes for service group A are mapped to pool A (primary GW1), routes for service group B to pool B (primary GW2). When a gateway fails, the other one takes over routes from both pools — load is distributed during normal operation while failover is still fully covered.
Example Setup: Home Network Failover
Typical home setup with two gateways: a Synology NAS (DS918, 192.168.2.151) and a Linux mini-PC ("Home", 192.168.2.5). Both on the same LAN, both can reach all backend hosts.
1. /peers → create "Home Gateway" and "DS918 Gateway", both enabled
2. /gateway-pools → create pool "Home network":
- Mode: Failover
- Failback cooldown: 900s (NAS preset, since DS918 has longer updates)
- Members: Home (position #1), DS918 (position #2)
3. Existing routes need NOTHING — failover kicks in automatically
4. Test:
- Reboot DS918 → Home as primary stays online → no impact
- Reboot Home → routes migrate to DS918 → services stay reachable
- Home recovers → after the 15 min cooldown, back to Home
For pure failover, stick with target_peer_id pinning. Only when you want load balancing (e.g. streaming + backup in parallel across both gateways) is the "Migrate routes" function needed.