Monitoring & Resilience

What does it do?

Uptime Monitoring checks at regular intervals whether the backends of your routes are reachable. Without monitoring, you only see an error when a user reports it.

Without monitoring:

Client  →  Caddy  →  Backend (crashed)  →  502 Bad Gateway
                                           ↑ nobody knows

With monitoring:

Monitor probes every 60s  →  Backend does not respond  →  Status: DOWN (red)
                                                         → Email alert
                                                         → Webhook: route_monitor_down
                                                         → Circuit Breaker reacts

How it works internally

GateControl starts a poller that, at the configured interval (default: 60 seconds), checks all routes with monitoring_enabled = 1.

HTTP routes (Layer 7):

HTTP GET to http(s)://<peer IP>:<target port>/
User-Agent: GateControl-Monitor/1.0
Expectation: status code 200-399 = UP, anything else = DOWN
With backend_https: HTTPS with rejectUnauthorized: false (accepts self-signed)
Timeout: configured in config.timeouts.monitorHttp

L4 routes (TCP/UDP):

TCP Connect to the backend port
Successful connection establishment = UP, timeout/error = DOWN
Timeout: configured in config.timeouts.monitorTcp

Parallelization: Max 10 concurrent checks per cycle.

Auto-WoL for gateway routes: If a gateway route with wol_enabled = 1 switches from UP to DOWN, the monitor calls handleRouteDownDetected, which sends a magic packet via the gateway (gateways.notifyWol, timeout 60s). This automatically wakes the LAN host when it has gone down — see concepts/home-gateway.md.

Fields stored per route:

Field	Description
`monitoring_status`	`up`, `down` or `unknown`
`monitoring_last_check`	Timestamp of the last check (ISO 8601)
`monitoring_response_time`	Response time in milliseconds
`monitoring_last_change`	Timestamp of the last status change

Use cases

Monitoring a Synology NAS

Route nas.example.com → Port 5001 (DSM). Monitoring detects when the NAS restarts after an update. You get an email when it goes down, and a second one when it's reachable again.

Multiple services on one server

Three routes point to the same peer, but different ports (3000, 8080, 5432). One service crashes — monitoring shows exactly which one. The others stay green.

Enabling the Circuit Breaker

Monitoring is a prerequisite for the Circuit Breaker. Only once monitoring detects an outage can the Circuit Breaker block the route and return 503 instead of sending requests into the void.

Combination with other features

Combination	Effect
Monitoring + Circuit Breaker	Monitoring checks drive the Circuit Breaker's state machine
Monitoring + Webhooks	Events `route_down` / `route_up` to external systems (Slack, Discord, etc.)
Monitoring + Email alerts	Immediate notification on status change
Monitoring + L4 routes	TCP check instead of HTTP check, detects port reachability

Important notes

The first check runs 10 seconds after GateControl starts — so all services have time to come up.
Monitoring checks the direct connection to the backend (peer IP + port), not the public domain access via Caddy.
For backend_https routes HTTPS is used, but the certificate is not validated — self-signed works.
Email alerts require a working SMTP configuration under Settings → Email.
Webhook events are called route_down and route_up (not route_monitor_down/route_monitor_up).
The monitoring interval applies globally to all routes — individual intervals per route are not possible.
If monitoring is disabled, the last status remains (it is not reset to unknown).

What does it do?

The Circuit Breaker detects when a backend is repeatedly unreachable and switches the route into a blocked state. Instead of sending requests into the void (and keeping clients waiting), Caddy responds immediately with 503.

Without Circuit Breaker:

Client 1  →  Caddy  →  Backend (dead)  →  30s timeout  →  502
Client 2  →  Caddy  →  Backend (dead)  →  30s timeout  →  502
Client 3  →  Caddy  →  Backend (dead)  →  30s timeout  →  502
... 100 clients wait 30 seconds simultaneously ...

With Circuit Breaker:

Monitoring: Backend dead (5x in a row)  →  Circuit Breaker: OPEN
Client 1  →  Caddy  →  503 "Service temporarily unavailable" (immediately, <1ms)
Client 2  →  Caddy  →  503 (immediately)
... after 30s timeout ...
Monitoring: Backend back  →  Circuit Breaker: CLOSED
Client 3  →  Caddy  →  Backend  →  200 OK  ✓

How it works internally

The Circuit Breaker implements a state machine with three states:

         Threshold failures reached
 CLOSED ──────────────────────────────→ OPEN
   ↑                                      │
   │  Check successful                    │ Timeout elapsed
   │                                      ↓
   └──────────────────────────────── HALF-OPEN
           Check failed ──→ OPEN

States:

Status	Caddy behavior	Badge color
Closed	Normal operation, requests are forwarded	Green
Open	Caddy immediately returns `503` with `Retry-After` header	Red
Half-Open	Monitoring check is allowed through; on success → Closed, on failure → Open	Amber

Configurable values:

Parameter	Default	Description
Threshold	5	Consecutive failures before the circuit opens
Timeout	30s	Seconds in open state before a half-open test takes place

Detailed flow:

Monitoring probes the backend periodically
On failure: failure counter is incremented (cb_failure_count in the routes table — persisted across restarts)
On success: counter is reset to 0
Counter reaches threshold → status switches to open, cb_opened_at is set
Caddy config is rebuilt: the route serves a static 503 response
After timeout seconds → status switches to half-open
Next monitoring check decides:
- Success → closed, Caddy config is restored
- Failure → open, timer restarts

Caddy configuration in the open state (from src/services/caddyConfig.js):

{
  "handle": [{
    "handler": "static_response",
    "status_code": "503",
    "body": "Service temporarily unavailable",
    "headers": { "Retry-After": ["30"] }
  }]
}

The Retry-After value matches the configured timeout (default 30).

Use cases

Preventing request pile-ups when the backend is dead

Without Circuit Breaker, all incoming requests wait for Caddy's timeout (30s). With 100 concurrent clients that means 100 blocked connections. With Circuit Breaker, all of them are answered with 503 immediately.

Preventing thundering herd on recovery

Backend was down for 5 minutes, 1000 clients have cached and are waiting for retry. Without Circuit Breaker all 1000 requests hit the just-started backend simultaneously. With Half-Open the Circuit Breaker only lets a single monitoring check through — only after it succeeds is the route opened again.

Fast feedback for better UX

Instead of waiting 30 seconds for a timeout, the user immediately sees a "Service temporarily unavailable" page. The page can include a Retry-After header, which modern browsers respect.

Combination with other features

Combination	Effect
Circuit Breaker + Monitoring	Mandatory: monitoring checks drive the state machine
Circuit Breaker + Retry	Retry tries with a closed circuit; with an open circuit: immediately 503
Circuit Breaker + Load Balancing	Circuit Breaker kicks in when all backends are down
Circuit Breaker + Webhooks	Events `circuit_breaker_open` / `circuit_breaker_closed`

Important notes

Monitoring is mandatory. Without Uptime Monitoring enabled, the Circuit Breaker has no data source and always stays in the closed state.
Failure counter and open timestamp are persisted in the database (cb_failure_count, cb_opened_at). Open circuits survive restarts; if the timestamp is missing after a restart, it is re-set on the first check run.
The Circuit Breaker operates per route, not per backend. For load balancing with multiple backends, the circuit opens when the monitoring target is unreachable.
In the open state no requests are forwarded to the backend — Caddy responds with 503 Service Unavailable + Retry-After header. No bypass per API or individual request.
Manual reset (since v1.50.4): POST /api/v1/routes/:id/circuit-breaker/reset or the Reset circuit breaker button in the route edit modal (only visible when status ≠ closed). Sets cb_failure_count = 0, cb_opened_at = NULL, status to closed, and re-renders the Caddy config immediately. Without this reset, an open breaker waits for the next monitoring cycle and runs through the normal open → half-open → closed path.
Circuit Breaker is only available for HTTP routes, not for L4 (TCP/UDP).

What does it do?

Rate Limiting counts each client IP's requests and blocks further requests once the limit is reached. The client then receives HTTP 429 (Too Many Requests) instead of a normal response.

Without Rate Limiting:

Bot sends 10,000 requests/minute  →  Backend processes all  →  Server overloaded

With Rate Limiting (100 requests/minute):

Bot sends 100 requests     →  Backend processes all  ✓
Bot sends request #101     →  Caddy: 429 Too Many Requests  ✕
Bot sends request #102     →  Caddy: 429 Too Many Requests  ✕
... after 1 minute ...
Bot sends request #1       →  Backend processes  ✓  (new time window)

How it works internally

GateControl uses the caddy-ratelimit plugin for route traffic. The rate-limit handler is inserted into Caddy's handler chain before the reverse proxy (src/services/caddyConfig.js, rate_limit_enabled block).

Not to be confused with the admin API limiters (src/middleware/rateLimit.js): these protect the GateControl Admin UI (/login, /api/v1/*) and are configured separately in Express. The rate limiting described here concerns exclusively the client traffic of a configured route.

Caddy JSON configuration:

{
  "handler": "rate_limit",
  "rate_limits": {
    "static": {
      "key": "{http.request.remote.host}",
      "window": "1m",
      "max_events": 100
    }
  }
}

Key: {http.request.remote.host} — each client IP gets its own quota.

Configurable values:

Parameter	Range	Default	Description
Requests	1 – 100,000	100	Maximum requests per time window
Window	1s, 1m, 5m, 1h	1m	Duration of the time window

Handler order in Caddy:

ACL / Forward Auth (if active)
Custom Request Headers (if present)
Rate Limit ← here
Request Mirroring (if active)
Compression (if active)
Reverse Proxy

Use cases

Protecting login pages against brute force

Route app.example.com → Web app with login. Rate Limit: 10 requests / 1 minute. An attacker can only make 10 password attempts per minute — this significantly slows down brute-force attacks.

Protecting an API against abuse

Route api.example.com → REST API. Rate Limit: 1000 requests / 5 minutes. Normal usage remains unaffected, but a single client cannot overload the API.

Preventing scraping

Route shop.example.com → Webshop. Rate Limit: 60 requests / 1 minute. Bots scraping prices are throttled after 60 page views per minute.

Recommended values:

Use case	Requests	Window
Login page	10–20	1m
REST API	500–1000	5m
Webshop / website	60–120	1m
Static assets	1000–5000	1m
Webhook endpoint	50–100	1m

Combination with other features

Combination	Effect
Rate Limit + Route Auth	Rate limit after the auth check — protects the backend, not the login page
Rate Limit + Basic Auth	Rate limit before auth — also protects against brute force on Basic Auth
Rate Limit + ACL	Only VPN peers get through, then get rate-limited
Rate Limit + IP filter	IP filter blocks known IPs, rate limit throttles the rest
Rate Limit + Compression	No conflict — Rate Limit counts requests, Compression compresses responses

Important notes

Rate Limiting is per IP address, not global. 100 requests/minute means: each individual IP may make 100 requests.
Behind a NAT router all clients share the same IP — the limit then applies to all of them together.
Allowed window values: 1s, 1m, 5m, 1h. Other values are normalized to 1m.
HTTP 429 contains no Retry-After header — the client must wait on its own until the window expires.
Rate Limiting is only available for HTTP routes, not for L4 (TCP/UDP).
For routes with Forward Auth (Route Auth or IP filter), rate limiting is applied after the auth check.
WebSocket connections only count the initial HTTP upgrade as one request.

What does it do?

If the backend returns an error or is unreachable, Caddy retries the request automatically instead of immediately sending an error to the client.

Without Retry:

Client  →  Caddy  →  Backend (just restarted)  →  502 Bad Gateway  →  Client sees error

With Retry (3 attempts):

Client  →  Caddy  →  Backend (attempt 1: 502)
                  →  Backend (attempt 2: 502)
                  →  Backend (attempt 3: 200 OK)  →  Client sees a normal response

With Retry + multiple backends:

Client  →  Caddy  →  Backend A (502)
                  →  Backend B (200 OK)  →  Client sees a normal response

How it works internally

GateControl configures Caddy's load_balancing.retries mechanism in the reverse-proxy handler (src/services/caddyConfig.js, retry_enabled block):

Caddy JSON configuration:

{
  "handler": "reverse_proxy",
  "upstreams": [
    { "dial": "10.8.0.3:8080" }
  ],
  "load_balancing": {
    "retries": 3
  }
}

Behavior:

Caddy retries the request up to retries times on connection errors
With one backend: all retries go to the same backend
With multiple backends: retries rotate to the next backend (round robin or weighted)
The retry logic is part of Caddy's load balancer — not a separate handler
Retry is triggered on connect errors and on the status codes from the UI field Retry Status Codes. Since v1.50.4 this list is actually forwarded to Caddy (reverse_proxy.load_balancing.retry_match), along with try_duration: 5s (without it, Caddy would otherwise ignore retries). Invalid tokens (non-numeric, outside 100–599) are silently discarded.

Configurable values:

Parameter	Range	Default	Description
Retry Count	1 – 10	3	Number of retry attempts
Retry Status Codes	CSV	502,503,504	Which response codes trigger a retry

Use cases

Catching a backend restart

Route app.example.com → Node.js app on port 3000. On deployment the app is briefly restarted (2-3 seconds of downtime). With 3 retries and a single backend, Caddy bridges this gap — at best the client notices a slightly longer load time.

Load balancing with failover

Route api.example.com → 3 API servers (Backend A, B, C). Server B fails. Caddy tries B, gets an error, and automatically forwards the request to C. The client notices nothing.

Temporary 503 errors under high load

Route service.example.com → Microservice that returns 503 under overload. With retries, the service has a moment to recover, and the next request goes through.

Combination with other features

Combination	Effect
Retry + Load Balancing	Retries rotate between backends — more effective than with a single backend
Retry + Circuit Breaker	Circuit Breaker prevents retries when the backend is permanently down
Retry + Monitoring	Monitoring detects if the backend is permanently down; Retry helps with short outages
Retry + Rate Limiting	Each retry attempt counts as one request to the backend, not against the client's rate limit

Important notes

POST/PUT/DELETE are retried as well. GateControl performs no automatic idempotency check — the admin must know whether the backend supports retryable write operations. Example: a retry on POST /api/orders could trigger a duplicate order. Only enable retry if the backend supports idempotent operations or only handles GET requests.
Retry is only available for HTTP routes, not for L4 (TCP/UDP).
Retries happen back-to-back — there is no exponential backoff.
With a single backend, retries can additionally load the server if it is already overloaded.
Retry Count of 1 means: 1 initial attempt + 1 retry = maximum 2 requests to the backend.
Retries are invisible to the client — they either receive the successful response or the last error.
In combination with Circuit Breaker: when the Circuit Breaker is open, no retries are attempted (Caddy immediately serves 503).

What does it do?

How it works internally

Use cases

Monitoring a Synology NAS

Multiple services on one server

Enabling the Circuit Breaker

Combination with other features

Important notes

What does it do?

How it works internally

Use cases

Preventing request pile-ups when the backend is dead

Preventing thundering herd on recovery

Fast feedback for better UX

Combination with other features

Important notes

What does it do?

How it works internally

Use cases

Protecting login pages against brute force

Protecting an API against abuse

Preventing scraping

Combination with other features

Important notes

What does it do?

How it works internally

Use cases

Catching a backend restart

Load balancing with failover

Temporary 503 errors under high load

Combination with other features

Important notes

Cookie Settings