CallMeTechie
DE Login
Home Products Blog About Contact

Gateway Pools

v1.0 · Updated 1 month ago

Table of Contents

  1. Problem & Motivation
  2. Concepts
  3. Setup
  4. Failover Mode in Detail
  5. Load Balancing Mode in Detail
  6. Health Monitoring
  7. Edge Cases
  8. Troubleshooting
  9. Architecture Reference

Problem & Motivation

GateControl routes normally pin a domain (nas.example.com) to exactly one gateway peer. The frontend Caddy on the GateControl server proxies to <gateway-ip>:<proxy-port> over the WireGuard tunnel; the companion Caddy on the gateway forwards to the local backend address (192.168.1.5:5001, etc.).

What goes wrong without pools:

  • Reboots of a gateway (Linux updates, Synology updates, Proxmox reboots) make all domains running through it unreachable for the duration of the reboot.
  • Hardware failure = full outage until manually reconfigured.
  • Scheduled maintenance windows require manually re-pinning every route.

What pools solve:

A gateway pool groups multiple gateway peers that can reach the same backend hosts. If one member fails, another takes over automatically — either via failover (priority order) or load balancing (parallel distribution).

Important: Pool members must actually be able to reach the backend hosts (192.168.1.5, etc.). If gateways sit on different LANs, failover does not help, because the failover member cannot reach the backend host at all.


Concepts

Term Meaning
Pool Logical group of gateway peers (DB: gateway_pools)
Mode failover or load_balancing — unique per pool
Member Peer with priority within a pool (DB: gateway_pool_members)
Priority Lower = higher priority. Position 1 is primary.
failback_cooldown_s After recovery: wait time before routes migrate back to the original member
outage_message Custom 503 body when all members are down
target_peer_id The peer ID currently serving a route (may change during failover)
original_peer_id Where the route was originally pinned — set during failover, reset on recovery
target_pool_id Alternative routing mode: route is bound explicitly to the pool (used for load balancing)

Two Routing Paths

GateControl supports two separate ways for a route to use a pool:

A) Peer-pinned with implicit failover (most common case)

  • Route has target_peer_id=<gw>, target_pool_id=NULL
  • Frontend Caddy reads target_peer_id directly from the DB
  • On failure: gatewayHealth._onTransition rewrites target_peer_id to the next-highest-priority alive pool member, remembering the original in original_peer_id
  • No Caddy-side runtime resolution needed — the DB is the source of truth

B) Pool-routed via Caddy resolver (for load balancing)

  • Route has target_pool_id=<pool>, target_peer_id=NULL
  • Frontend Caddy resolves at runtime via gatewayPool.resolveActivePeer(s) using the health snapshot
  • For load balancing, Caddy receives all alive members as upstreams[] plus the selection policy
  • For failover mode at the pool level, exactly one member is selected (highest priority alive)

For plain failover you don't need to migrate anything. Pool membership is enough. For load balancing, you must explicitly switch routes to the pool via "Migrate routes".


Setup

1. Prerequisites

  • At least 2 gateway peers configured (/peers, peer_type=gateway)
  • Both gateways must be able to reach the backend hosts in their LAN
  • License: gateway_pools=true (failover is usually included in the base plan, load balancing often requires Pro)

2. Create the Pool

/gateway-pools → "Create pool":

  1. Name: Meaningful identifier (e.g. Home network)
  2. Mode:
    • Failover (prioritized) — primary gateway serves, backup only takes over on failure
    • Load balancing — all alive members serve in parallel (license-dependent)
  3. LB policy (load balancing only):
    • round_robin — even distribution by order
    • least_conn — member with fewest active connections
    • ip_hash — sticky per client IP (same client → same member)
  4. Failback cooldown: how long to wait after recovery before routes migrate back. Presets:
    • 60 s — Linux container (LXC), fast reboot
    • 180 s — Linux VM
    • 600 s (10 min) — Proxmox host
    • 900 s (15 min) — Synology / QNAP NAS
    • 1800 s (30 min) — Windows server
    • 3600 s (60 min) — Conservative
  5. Outage message (optional): custom 503 body when ALL members are down

3. Add Members

In the modal on the right:

  1. Pick a gateway from the dropdown → "Add"
  2. Position #1 is primary (highest priority). Reorder via drag and drop.
  3. Already added gateways disappear from the dropdown (no duplicates possible)
  4. "Save" persists atomically (PUT /api/v1/gateway-pools/:id/members)

4. Wire Up Routes

For failover: Nothing to do. As soon as a route's target_peer_id belongs to a pool member, the failover logic kicks in automatically on the next state change.

For load balancing:

/gateway-pools → "Migrate routes":

  1. Pick the target pool
  2. The list shows all gateway-pinned routes, grouped by source peer
  3. Loopback routes (127.0.0.1) are highlighted yellow and unchecked by default — reason: ssh.example.com → 127.0.0.1:22 means "on the gateway machine itself". Migrating that to another member changes the destination machine.
  4. Review the selection, "OK" → atomic DB update + Caddy resync

Alternatively: edit routes individually via /routes and set target_kind: pool, picking the target pool.


Failover Mode in Detail

Mechanism

Normal operation:
  routes.target_peer_id = 79 (Home)
  routes.original_peer_id = NULL
  ↓
  Frontend Caddy → 10.8.0.8:18080 (Home companion Caddy)
  → 192.168.1.5:5001 (NAS on Home LAN)

Home goes down:
  watchdogTick (every 30 s) → evaluatePeer → alive=0 → transition='alive_to_down'
  ↓
  _onTransition('alive_to_down', peerId=79):
    UPDATE routes
    SET target_peer_id = 84,         -- DS918, highest priority alive
        original_peer_id = 79,       -- remember source
        updated_at = NOW
    WHERE target_peer_id = 79 AND original_peer_id IS NULL AND target_kind = 'gateway'
    ↓
    syncToCaddy()  -- Caddy reload with new upstream
    notifyConfigChanged(79)  -- Home no longer holds the routes
    notifyConfigChanged(84)  -- DS918 picks them up
  ↓
  Frontend Caddy → 10.8.0.2:18080 (DS918 companion Caddy)
  → 192.168.1.5:5001 (NAS, now reached via DS918)

Home recovers:
  watchdogTick → cooldown_to_alive → transition fires AFTER failback_cooldown_s
  ↓
  _onTransition('cooldown_to_alive', peerId=79):
    UPDATE routes
    SET target_peer_id = original_peer_id,
        original_peer_id = NULL,
        updated_at = NOW
    WHERE original_peer_id = 79
    ↓
    syncToCaddy() + notifyConfigChanged for 79 + 84
  ↓
  Routes back on Home, business as usual.

Boot Reconcile

Transitions only fire on state changes. If a peer is already offline at container start, there's no alive_to_down → no pivot. gatewayHealth.reconcileFailoverState() runs once at server boot to catch up:

  1. For each offline pool member: find an alive sibling, pivot routes
  2. For each route with original_peer_id != NULL whose original peer is alive again: migrate the route back

This keeps the DB consistent with the real health state after a container restart.

Observability

Activity log events:

  • gateway_down / gateway_alive — state change per peer
  • pool_failover_activated — routes pivoted onto a sibling (with fromPeerId/toPeerId)
  • pool_failover_restored — routes restored to the original peer
  • pool_outage_started / pool_outage_resolved — all/first pool member offline

Webhook (if configured): gateway_state_change with payload {peer_id, alive: bool}.

SQL for the current state:

SELECT id, domain, target_peer_id, original_peer_id
FROM routes WHERE target_kind='gateway' AND enabled=1;

A single row tells you who routes where — original_peer_id != NULL means "currently in failover".


Load Balancing Mode in Detail

Activation

A route uses LB only when all three conditions are true:

  1. route.target_pool_id is set (NOT target_peer_id)
  2. Pool mode = 'load_balancing'
  3. Pool lb_policy ∈ {round_robin, least_conn, ip_hash}

Caddy Configuration (HTTP)

During the build cycle, caddyConfig.resolveRouteUpstreams calls gatewayPool.resolveActivePeers(snapshot) — which returns all alive pool members:

"reverse_proxy": {
  "upstreams": [
    { "dial": "10.8.0.2:18080" },
    { "dial": "10.8.0.8:18080" }
  ],
  "load_balancing": {
    "selection_policy": { "policy": "round_robin" }
  },
  "health_checks": {
    "passive": {
      "fail_duration": "30s",
      "max_fails": 3,
      "unhealthy_status": [500, 502, 503, 504]
    }
  }
}

LB Policies

Policy Behavior When to use
round_robin Round-robin through the upstream list Symmetric load, identical members
least_conn Member with fewest active connections Long streams (Jellyfin streaming, backups)
ip_hash Hash over client IP → fixed member Session affinity without cookies (legacy apps)

Trusted Proxies (for ip_hash behind a CDN/LB)

If GateControl sits behind a private LB or a CDN, the server Caddy only sees the proxy IP as the remote IP — ip_hash would then be ineffective (all requests hit the same bucket). The workaround is built into the srv0 HTTP server:

{
  "trusted_proxies": {
    "source": "static",
    "ranges": [
      "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16",
      "100.64.0.0/10", "fd00::/8", "::1/128", "127.0.0.0/8"
    ]
  },
  "client_ip_headers": ["X-Forwarded-For"]
}

X-Forwarded-For is only honored when the connection originates from one of the listed ranges. Clients hitting GateControl directly from the internet cannot spoof XFF, because their source IP is not in the trust list.

Passive Health Checks

In addition to the gatewayHealth watchdog (every 30 s), Caddy itself checks passively:

  • Sends a request to member X
  • X returns a 5xx status (or fails to connect)
  • After max_fails=3 such failures, X is removed from rotation for fail_duration=30s
  • After 30 s, retried

This makes circuit breaking kick in within seconds, instead of waiting for the global health logic's 90 s gateway_down_threshold.

L4 Load Balancing (TCP/UDP)

Works the same way for L4 routes — caddy-l4's proxy handler accepts multiple upstreams[] plus load_balancing.selection_policy:

{
  "handler": "proxy",
  "upstreams": [
    { "dial": ["10.8.0.2:13389"] },
    { "dial": ["10.8.0.8:13389"] }
  ],
  "load_balancing": {
    "selection_policy": { "policy": "round_robin" }
  },
  "health_checks": {
    "passive": { "fail_duration": "30s", "max_fails": 3 }
  }
}

L4 routes must have route_type='l4' and be pool-bound via target_pool_id.


Health Monitoring

Heartbeat Model

  • The companion on each gateway sends a heartbeat (HTTPS) to the server every ~10 s
  • Server stores gateway_meta.last_seen_at
  • evaluatePeer (every 30 s per watchdog tick) checks now - last_seen_at against gateway_down_threshold_s (default 90 s)

State Machine

       evaluatePeer()                 evaluatePeer()
            │                              │
            ▼                              ▼
    ┌───────────┐  isStale       ┌───────────┐
    │  alive=1  │ ────────────▶ │  alive=0  │
    │           │                │           │
    │  (alive)  │ ◀──────────── │  (down)   │
    └───────────┘  cooldown      └───────────┘
                   _to_alive            │
                       ▲                │ heartbeat received
                       │                │ (still within threshold)
                       │                ▼
                       │          ┌───────────┐
                       │          │  alive=0  │
                       │          │recovered_ │
                       │          │first_hb_at│
                       └───────── │(cooldown) │
                                  └───────────┘
                                       │
                                       │ heartbeat gap
                                       ▼
                                  cooldown_reset
                                  (back to down)

Transitions:

  • alive_to_down — peer was alive, now stale → pivot routes
  • down_to_cooldown — peer was down, first heartbeat arrived → start cooldown timer
  • cooldown_to_alive — cooldown period elapsed without a heartbeat gap → recovery, restore routes
  • cooldown_reset — heartbeat gap during cooldown → back to down (no restore!)
  • first_alive — very first heartbeat with no went_down_at marker (e.g. after a DB wipe)

failback_cooldown_s is configurable per pool — getMaxCooldownForPeer(peerId) takes the maximum across all pools the peer belongs to.

Snapshot

gatewayHealth.getSnapshot() is the in-memory cache with {[peerId]: {alive, last_seen_at, went_down_at, recovered_first_hb_at}}. Updated by evaluatePeer. caddyConfig.resolveRouteUpstreams reads it during the build.

Boot edge case: snapshot starts empty. resolveRouteUpstreams seeds it from the DB (gateway_meta.alive) so pool routes resolve correctly during the boot Caddy build instead of returning 503.


Edge Cases

All Members Offline

resolveActivePeer returns null / resolveActivePeers returns []resolveRouteUpstreams returns outage: truecaddyConfig renders a 503 static response with outage_message (instead of a reverse_proxy to a dead backend). Users see a controlled outage page.

Multiple Failovers

T=0: Home alive, DS918 alive          → routes target_peer_id=79
T=1: Home → down                      → routes target_peer_id=84, original=79
T=2: DS918 → down (Home still down)   → no pivot (no alive sibling), routes stay on 84
T=3: Home → recovered                 → routes target_peer_id=79, original=NULL
                                        (pivot logic finds original=79 → restored)

The WHERE original_peer_id IS NULL guard in the failover UPDATE prevents "double pivot" — a route already pivoted away is NOT touched again on a second failover. This way original_peer_id always points to the truly original pin.

Removing a Peer From a Pool While Routes Are Pinned to It

gatewayPool.replaceMembers checks when emptying a pool (members.length === 0):

  • Are there still routes with target_pool_id = poolId?
  • Are there still RDP routes with gateway_pool_id = poolId?
  • If yes → last_member_in_use error, save aborted

For target_peer_id (peer-pinned without pool binding), removing from the pool is not blocked — the route stays pinned to the peer, just without failover protection.

Peer in Multiple Pools

listPoolsForPeer(peerId) returns all pools the peer belongs to. _onTransition iterates them all:

  • On the first pool with an alive sibling: pivot routes, break out of the loop
  • Subsequent pools would find nothing on this pivot anyway (WHERE original_peer_id IS NULL no longer matches)

Multi-pool membership is supported, but only one pool wins per state change.

Changing Priority via Drag and Drop

PUT /api/v1/gateway-pools/:id/members with the full member list in DOM order. Backend acts atomically:

DELETE FROM gateway_pool_members WHERE pool_id = ?;
INSERT INTO gateway_pool_members ... -- per member, new priority = index+1

in a transaction. UI order = new priority. Then applyPoolMutationWithSequencing: a single Caddy sync + companion confirm.


Architecture Reference

File Map

File Responsibility
src/services/gatewayPool.js Pool/member CRUD, replaceMembers (atomic), resolveActivePeer(s)
src/services/gatewayHealth.js Heartbeat watchdog, _onTransition (DB pivot on state change), reconcileFailoverState (boot)
src/services/caddyConfig.js buildCaddyConfig, resolveRouteUpstreams (for target_pool_id), HTTP LB + passive HC
src/services/l4.js L4 Caddy server config including multi-upstream + LB
src/services/gatewayPoolSync.js applyPoolMutationWithSequencing — companion confirm + Caddy sync after pool mutation
src/services/gateways.js getGatewayConfig (companion pull), notifyConfigChanged (server push), hash computing
src/routes/api/gatewayPools.js REST API: CRUD, members bulk PUT, migrate-routes
src/db/migrationList.js v40: pools+members, v42: proxy_port, v43: routes.original_peer_id

DB Schema (relevant)

CREATE TABLE gateway_pools (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL UNIQUE,
  mode TEXT NOT NULL,            -- 'failover' | 'load_balancing'
  lb_policy TEXT,                -- NULL for failover, otherwise round_robin/least_conn/ip_hash
  failback_cooldown_s INTEGER NOT NULL,
  outage_message TEXT,
  enabled INTEGER NOT NULL DEFAULT 1,
  created_at, updated_at
);

CREATE TABLE gateway_pool_members (
  pool_id INTEGER REFERENCES gateway_pools(id) ON DELETE CASCADE,
  peer_id INTEGER REFERENCES peers(id) ON DELETE CASCADE,
  priority INTEGER NOT NULL,
  PRIMARY KEY (pool_id, peer_id)
);

-- routes table (relevant for pools):
--   target_kind        TEXT       'gateway' for gateway routes
--   target_peer_id     INTEGER    currently serving peer (may change via pivot)
--   target_pool_id     INTEGER    alternative: explicit pool routing (LB)
--   original_peer_id   INTEGER    v43: original pin for failover restore (NULL = not in failover)

CREATE TABLE gateway_meta (
  peer_id INTEGER PRIMARY KEY,
  api_port INTEGER NOT NULL DEFAULT 9876,
  api_token_hash TEXT NOT NULL,
  push_token_encrypted TEXT NOT NULL,
  alive INTEGER NOT NULL DEFAULT 1,
  last_seen_at INTEGER,
  went_down_at INTEGER,
  recovered_first_hb_at INTEGER,
  last_config_hash TEXT,
  proxy_port INTEGER NOT NULL DEFAULT 8080,    -- v42: per-peer proxy port (Synology DSM uses 18080)
  created_at INTEGER NOT NULL
);

REST API (excerpt)

GET    /api/v1/gateway-pools                                List pools
POST   /api/v1/gateway-pools                                Create pool
GET    /api/v1/gateway-pools/:id                            Get pool with members
PUT    /api/v1/gateway-pools/:id                            Update pool
DELETE /api/v1/gateway-pools/:id                            Delete pool (rejected if in use)

GET    /api/v1/gateway-pools/:id/members                    List members
POST   /api/v1/gateway-pools/:id/members                    Add single member
PUT    /api/v1/gateway-pools/:id/members                    Bulk replace member set (atomic)
PUT    /api/v1/gateway-pools/:id/members/:peerId            Set priority of a single member
DELETE /api/v1/gateway-pools/:id/members/:peerId            Remove member

GET    /api/v1/gateway-pools/migration-candidates           List peer-pinned routes + pools (for the UI)
POST   /api/v1/gateway-pools/:id/migrate-routes             { route_ids: [...] } — bulk migrate

Configuration

gateway_down_threshold_s (in the settings table) — global, default 90 s. Lower = faster reaction to outages, but more sensitive to network hiccups. Adjustable via /settings → "Gateway failover" (UI limit: 30–600 s).

Active-Active Pattern (Two Pools With Mirrored Priority)

A single pool has exactly one mode. Active-active is achieved with two pools using mirrored priority:

Pool GW1 prio GW2 prio
Home network A 1 2
Home network B 2 1

Routes for service group A are mapped to pool A (primary GW1), routes for service group B to pool B (primary GW2). When a gateway fails, the other one takes over routes from both pools — load is distributed during normal operation while failover is still fully covered.


Example Setup: Home Network Failover

Typical home setup with two gateways: a Synology NAS (DS918, 192.168.2.151) and a Linux mini-PC ("Home", 192.168.2.5). Both on the same LAN, both can reach all backend hosts.

1. /peers → create "Home Gateway" and "DS918 Gateway", both enabled
2. /gateway-pools → create pool "Home network":
   - Mode: Failover
   - Failback cooldown: 900s (NAS preset, since DS918 has longer updates)
   - Members: Home (position #1), DS918 (position #2)
3. Existing routes need NOTHING — failover kicks in automatically
4. Test:
   - Reboot DS918 → Home as primary stays online → no impact
   - Reboot Home → routes migrate to DS918 → services stay reachable
   - Home recovers → after the 15 min cooldown, back to Home

For pure failover, stick with target_peer_id pinning. Only when you want load balancing (e.g. streaming + backup in parallel across both gateways) is the "Migrate routes" function needed.

Cookie Settings

We use cookies to improve your experience. Essential cookies are always active.

Privacy Policy
ESC
↑↓ navigate open esc close