Monitoring & Resilience
What does it do?
Uptime Monitoring checks at regular intervals whether the backends of your routes are reachable. Without monitoring, you only see an error when a user reports it.
Without monitoring:
Client → Caddy → Backend (crashed) → 502 Bad Gateway
↑ nobody knows
With monitoring:
Monitor probes every 60s → Backend does not respond → Status: DOWN (red)
→ Email alert
→ Webhook: route_monitor_down
→ Circuit Breaker reacts
How it works internally
GateControl starts a poller that, at the configured interval (default: 60 seconds), checks all routes with monitoring_enabled = 1.
HTTP routes (Layer 7):
- HTTP GET to
http(s)://<peer IP>:<target port>/ - User-Agent:
GateControl-Monitor/1.0 - Expectation: status code 200-399 = UP, anything else = DOWN
- With
backend_https: HTTPS withrejectUnauthorized: false(accepts self-signed) - Timeout: configured in
config.timeouts.monitorHttp
L4 routes (TCP/UDP):
- TCP Connect to the backend port
- Successful connection establishment = UP, timeout/error = DOWN
- Timeout: configured in
config.timeouts.monitorTcp
Parallelization: Max 10 concurrent checks per cycle.
Auto-WoL for gateway routes: If a gateway route with wol_enabled = 1 switches from UP to DOWN, the monitor calls handleRouteDownDetected, which sends a magic packet via the gateway (gateways.notifyWol, timeout 60s). This automatically wakes the LAN host when it has gone down — see concepts/home-gateway.md.
Fields stored per route:
| Field | Description |
|---|---|
monitoring_status |
up, down or unknown |
monitoring_last_check |
Timestamp of the last check (ISO 8601) |
monitoring_response_time |
Response time in milliseconds |
monitoring_last_change |
Timestamp of the last status change |
Use cases
Monitoring a Synology NAS
Route nas.example.com → Port 5001 (DSM). Monitoring detects when the NAS restarts after an update. You get an email when it goes down, and a second one when it's reachable again.
Multiple services on one server
Three routes point to the same peer, but different ports (3000, 8080, 5432). One service crashes — monitoring shows exactly which one. The others stay green.
Enabling the Circuit Breaker
Monitoring is a prerequisite for the Circuit Breaker. Only once monitoring detects an outage can the Circuit Breaker block the route and return 503 instead of sending requests into the void.
Combination with other features
| Combination | Effect |
|---|---|
| Monitoring + Circuit Breaker | Monitoring checks drive the Circuit Breaker's state machine |
| Monitoring + Webhooks | Events route_down / route_up to external systems (Slack, Discord, etc.) |
| Monitoring + Email alerts | Immediate notification on status change |
| Monitoring + L4 routes | TCP check instead of HTTP check, detects port reachability |
Important notes
- The first check runs 10 seconds after GateControl starts — so all services have time to come up.
- Monitoring checks the direct connection to the backend (peer IP + port), not the public domain access via Caddy.
- For
backend_httpsroutes HTTPS is used, but the certificate is not validated — self-signed works. - Email alerts require a working SMTP configuration under Settings → Email.
- Webhook events are called
route_downandroute_up(notroute_monitor_down/route_monitor_up). - The monitoring interval applies globally to all routes — individual intervals per route are not possible.
- If monitoring is disabled, the last status remains (it is not reset to
unknown).
What does it do?
The Circuit Breaker detects when a backend is repeatedly unreachable and switches the route into a blocked state. Instead of sending requests into the void (and keeping clients waiting), Caddy responds immediately with 503.
Without Circuit Breaker:
Client 1 → Caddy → Backend (dead) → 30s timeout → 502
Client 2 → Caddy → Backend (dead) → 30s timeout → 502
Client 3 → Caddy → Backend (dead) → 30s timeout → 502
... 100 clients wait 30 seconds simultaneously ...
With Circuit Breaker:
Monitoring: Backend dead (5x in a row) → Circuit Breaker: OPEN
Client 1 → Caddy → 503 "Service temporarily unavailable" (immediately, <1ms)
Client 2 → Caddy → 503 (immediately)
... after 30s timeout ...
Monitoring: Backend back → Circuit Breaker: CLOSED
Client 3 → Caddy → Backend → 200 OK ✓
How it works internally
The Circuit Breaker implements a state machine with three states:
Threshold failures reached
CLOSED ──────────────────────────────→ OPEN
↑ │
│ Check successful │ Timeout elapsed
│ ↓
└──────────────────────────────── HALF-OPEN
Check failed ──→ OPEN
States:
| Status | Caddy behavior | Badge color |
|---|---|---|
| Closed | Normal operation, requests are forwarded | Green |
| Open | Caddy immediately returns 503 with Retry-After header |
Red |
| Half-Open | Monitoring check is allowed through; on success → Closed, on failure → Open | Amber |
Configurable values:
| Parameter | Default | Description |
|---|---|---|
| Threshold | 5 | Consecutive failures before the circuit opens |
| Timeout | 30s | Seconds in open state before a half-open test takes place |
Detailed flow:
- Monitoring probes the backend periodically
- On failure: failure counter is incremented (
cb_failure_countin theroutestable — persisted across restarts) - On success: counter is reset to 0
- Counter reaches threshold → status switches to
open,cb_opened_atis set - Caddy config is rebuilt: the route serves a static 503 response
- After timeout seconds → status switches to
half-open - Next monitoring check decides:
- Success →
closed, Caddy config is restored - Failure →
open, timer restarts
- Success →
Caddy configuration in the open state (from src/services/caddyConfig.js):
{
"handle": [{
"handler": "static_response",
"status_code": "503",
"body": "Service temporarily unavailable",
"headers": { "Retry-After": ["30"] }
}]
}
The Retry-After value matches the configured timeout (default 30).
Use cases
Preventing request pile-ups when the backend is dead
Without Circuit Breaker, all incoming requests wait for Caddy's timeout (30s). With 100 concurrent clients that means 100 blocked connections. With Circuit Breaker, all of them are answered with 503 immediately.
Preventing thundering herd on recovery
Backend was down for 5 minutes, 1000 clients have cached and are waiting for retry. Without Circuit Breaker all 1000 requests hit the just-started backend simultaneously. With Half-Open the Circuit Breaker only lets a single monitoring check through — only after it succeeds is the route opened again.
Fast feedback for better UX
Instead of waiting 30 seconds for a timeout, the user immediately sees a "Service temporarily unavailable" page. The page can include a Retry-After header, which modern browsers respect.
Combination with other features
| Combination | Effect |
|---|---|
| Circuit Breaker + Monitoring | Mandatory: monitoring checks drive the state machine |
| Circuit Breaker + Retry | Retry tries with a closed circuit; with an open circuit: immediately 503 |
| Circuit Breaker + Load Balancing | Circuit Breaker kicks in when all backends are down |
| Circuit Breaker + Webhooks | Events circuit_breaker_open / circuit_breaker_closed |
Important notes
- Monitoring is mandatory. Without Uptime Monitoring enabled, the Circuit Breaker has no data source and always stays in the closed state.
- Failure counter and open timestamp are persisted in the database (
cb_failure_count,cb_opened_at). Open circuits survive restarts; if the timestamp is missing after a restart, it is re-set on the first check run. - The Circuit Breaker operates per route, not per backend. For load balancing with multiple backends, the circuit opens when the monitoring target is unreachable.
- In the open state no requests are forwarded to the backend — Caddy responds with
503 Service Unavailable+Retry-Afterheader. No bypass per API or individual request. - Manual reset (since v1.50.4):
POST /api/v1/routes/:id/circuit-breaker/resetor the Reset circuit breaker button in the route edit modal (only visible when status ≠ closed). Setscb_failure_count = 0,cb_opened_at = NULL, status toclosed, and re-renders the Caddy config immediately. Without this reset, an open breaker waits for the next monitoring cycle and runs through the normalopen → half-open → closedpath. - Circuit Breaker is only available for HTTP routes, not for L4 (TCP/UDP).
What does it do?
Rate Limiting counts each client IP's requests and blocks further requests once the limit is reached. The client then receives HTTP 429 (Too Many Requests) instead of a normal response.
Without Rate Limiting:
Bot sends 10,000 requests/minute → Backend processes all → Server overloaded
With Rate Limiting (100 requests/minute):
Bot sends 100 requests → Backend processes all ✓
Bot sends request #101 → Caddy: 429 Too Many Requests ✕
Bot sends request #102 → Caddy: 429 Too Many Requests ✕
... after 1 minute ...
Bot sends request #1 → Backend processes ✓ (new time window)
How it works internally
GateControl uses the caddy-ratelimit plugin for route traffic. The rate-limit handler is inserted into Caddy's handler chain before the reverse proxy (src/services/caddyConfig.js, rate_limit_enabled block).
Not to be confused with the admin API limiters (src/middleware/rateLimit.js): these protect the GateControl Admin UI (/login, /api/v1/*) and are configured separately in Express. The rate limiting described here concerns exclusively the client traffic of a configured route.
Caddy JSON configuration:
{
"handler": "rate_limit",
"rate_limits": {
"static": {
"key": "{http.request.remote.host}",
"window": "1m",
"max_events": 100
}
}
}
Key: {http.request.remote.host} — each client IP gets its own quota.
Configurable values:
| Parameter | Range | Default | Description |
|---|---|---|---|
| Requests | 1 – 100,000 | 100 | Maximum requests per time window |
| Window | 1s, 1m, 5m, 1h | 1m | Duration of the time window |
Handler order in Caddy:
- ACL / Forward Auth (if active)
- Custom Request Headers (if present)
- Rate Limit ← here
- Request Mirroring (if active)
- Compression (if active)
- Reverse Proxy
Use cases
Protecting login pages against brute force
Route app.example.com → Web app with login. Rate Limit: 10 requests / 1 minute. An attacker can only make 10 password attempts per minute — this significantly slows down brute-force attacks.
Protecting an API against abuse
Route api.example.com → REST API. Rate Limit: 1000 requests / 5 minutes. Normal usage remains unaffected, but a single client cannot overload the API.
Preventing scraping
Route shop.example.com → Webshop. Rate Limit: 60 requests / 1 minute. Bots scraping prices are throttled after 60 page views per minute.
Recommended values:
| Use case | Requests | Window |
|---|---|---|
| Login page | 10–20 | 1m |
| REST API | 500–1000 | 5m |
| Webshop / website | 60–120 | 1m |
| Static assets | 1000–5000 | 1m |
| Webhook endpoint | 50–100 | 1m |
Combination with other features
| Combination | Effect |
|---|---|
| Rate Limit + Route Auth | Rate limit after the auth check — protects the backend, not the login page |
| Rate Limit + Basic Auth | Rate limit before auth — also protects against brute force on Basic Auth |
| Rate Limit + ACL | Only VPN peers get through, then get rate-limited |
| Rate Limit + IP filter | IP filter blocks known IPs, rate limit throttles the rest |
| Rate Limit + Compression | No conflict — Rate Limit counts requests, Compression compresses responses |
Important notes
- Rate Limiting is per IP address, not global. 100 requests/minute means: each individual IP may make 100 requests.
- Behind a NAT router all clients share the same IP — the limit then applies to all of them together.
- Allowed window values:
1s,1m,5m,1h. Other values are normalized to1m. - HTTP 429 contains no
Retry-Afterheader — the client must wait on its own until the window expires. - Rate Limiting is only available for HTTP routes, not for L4 (TCP/UDP).
- For routes with Forward Auth (Route Auth or IP filter), rate limiting is applied after the auth check.
- WebSocket connections only count the initial HTTP upgrade as one request.
What does it do?
If the backend returns an error or is unreachable, Caddy retries the request automatically instead of immediately sending an error to the client.
Without Retry:
Client → Caddy → Backend (just restarted) → 502 Bad Gateway → Client sees error
With Retry (3 attempts):
Client → Caddy → Backend (attempt 1: 502)
→ Backend (attempt 2: 502)
→ Backend (attempt 3: 200 OK) → Client sees a normal response
With Retry + multiple backends:
Client → Caddy → Backend A (502)
→ Backend B (200 OK) → Client sees a normal response
How it works internally
GateControl configures Caddy's load_balancing.retries mechanism in the reverse-proxy handler (src/services/caddyConfig.js, retry_enabled block):
Caddy JSON configuration:
{
"handler": "reverse_proxy",
"upstreams": [
{ "dial": "10.8.0.3:8080" }
],
"load_balancing": {
"retries": 3
}
}
Behavior:
- Caddy retries the request up to
retriestimes on connection errors - With one backend: all retries go to the same backend
- With multiple backends: retries rotate to the next backend (round robin or weighted)
- The retry logic is part of Caddy's load balancer — not a separate handler
- Retry is triggered on connect errors and on the status codes from the UI field Retry Status Codes. Since v1.50.4 this list is actually forwarded to Caddy (
reverse_proxy.load_balancing.retry_match), along withtry_duration: 5s(without it, Caddy would otherwise ignoreretries). Invalid tokens (non-numeric, outside 100–599) are silently discarded.
Configurable values:
| Parameter | Range | Default | Description |
|---|---|---|---|
| Retry Count | 1 – 10 | 3 | Number of retry attempts |
| Retry Status Codes | CSV | 502,503,504 | Which response codes trigger a retry |
Use cases
Catching a backend restart
Route app.example.com → Node.js app on port 3000. On deployment the app is briefly restarted (2-3 seconds of downtime). With 3 retries and a single backend, Caddy bridges this gap — at best the client notices a slightly longer load time.
Load balancing with failover
Route api.example.com → 3 API servers (Backend A, B, C). Server B fails. Caddy tries B, gets an error, and automatically forwards the request to C. The client notices nothing.
Temporary 503 errors under high load
Route service.example.com → Microservice that returns 503 under overload. With retries, the service has a moment to recover, and the next request goes through.
Combination with other features
| Combination | Effect |
|---|---|
| Retry + Load Balancing | Retries rotate between backends — more effective than with a single backend |
| Retry + Circuit Breaker | Circuit Breaker prevents retries when the backend is permanently down |
| Retry + Monitoring | Monitoring detects if the backend is permanently down; Retry helps with short outages |
| Retry + Rate Limiting | Each retry attempt counts as one request to the backend, not against the client's rate limit |
Important notes
- POST/PUT/DELETE are retried as well. GateControl performs no automatic idempotency check — the admin must know whether the backend supports retryable write operations. Example: a retry on
POST /api/orderscould trigger a duplicate order. Only enable retry if the backend supports idempotent operations or only handles GET requests. - Retry is only available for HTTP routes, not for L4 (TCP/UDP).
- Retries happen back-to-back — there is no exponential backoff.
- With a single backend, retries can additionally load the server if it is already overloaded.
- Retry Count of 1 means: 1 initial attempt + 1 retry = maximum 2 requests to the backend.
- Retries are invisible to the client — they either receive the successful response or the last error.
- In combination with Circuit Breaker: when the Circuit Breaker is open, no retries are attempted (Caddy immediately serves 503).