Troubleshooting
Audience: Ops in field deployment. The entries follow the pattern symptom → causes → check → fix. Every command is such that you can paste it directly into the terminal.
For systemic references see deployment.md, upgrade.md, backup-and-restore.md.
Table of contents
- Container / Host
- DNS & TLS
- WireGuard
- Home-Gateway
- RDP
- Database
- Logging & Debugging
- Known Ops gotchas
- When to ask for help
Container / Host
GateControl container doesn't start / crashes after start
Causes:
.envmissing or contains invalid mandatory values (GC_ADMIN_PASSWORD,GC_WG_HOST).- Port 53 on
127.0.0.1is occupied (another DNS resolver on the host). - Ports 80/443/51820 are occupied by another service.
- Docker volume corrupted (rare SQLite lock leaks).
- New version has a migration problem (see upgrade.md §10).
Check:
docker logs --tail 100 gatecontrol
Look for:
ERROR: GC_ADMIN_PASSWORD is not set or still default→ check.env.ERROR: GC_WG_HOST is not set or still the example value→ check.env.ERROR: 127.0.0.1:53 is already bound — dnsmasq cannot start→ port-53 holder:
Typical candidates:ss -lntup | grep ':53 'named/bind9,unbound,dnsmasq(from NetworkManager or libvirt),piholedirectly on the host.Error: SQLITE_CORRUPT→ restore from backup (see backup-and-restore.md).
Fix:
- After env change:
docker compose up -d --force-recreate gatecontrol. - Port conflict: stop the opposing process or map to other ports (with
GateControl not easily possible — see
network_mode: host). - If the container loops and you don't see anything concrete:
docker inspect gatecontrol --format '{{ .State.Status }} — exit {{ .State.ExitCode }} — {{ .State.Error }}'
Health check stays on starting
Causes:
- Caddy is still waiting for certificates (first start, Let's Encrypt HTTP-01 is running).
- Database migrations take time (large existing DB, index rebuild).
- Node hangs at init step (e.g. DNS hosts file build).
Check:
docker logs --tail 50 gatecontrol
docker exec gatecontrol curl -s http://127.0.0.1:3000/health
checks.db: false → migrations not through, look further in the log.
checks.wireguard: false → wg0 is not yet up, wg show wg0 should
confirm that.
Fix:
First wait 60–90 seconds. If the state stays stable, probably an error state — see above.
High CPU/RAM load
Causes:
- Uptime monitoring runs on many targets with a short interval.
- Request tracing (
debug_enabled) is on for a route with a lot of traffic. - Traffic snapshot interval too short (
GC_TRAFFIC_INTERVAL). - Too many simultaneous Caddy reloads (every peer/route save triggers a
/load).
Check:
docker stats gatecontrol --no-stream
docker exec gatecontrol top -b -n 1 | head -20
Process names in the output: node (admin server), caddy, dnsmasq, wg-quick,
supervisord. Whoever needs over 60 % CPU sustained is the outlier.
node high → in UI check which routes have debug_enabled=1 and if
necessary switch off individually.
caddy high → many active connections/uptime checks —
look at /api/v1/caddy/status.
Fix:
- Turn up monitoring intervals (default 60 s is conservative, below 15 s problematic).
- Switch off
debug_enabledwhen done debugging. - Set
GC_TRAFFIC_INTERVALto 120 s.
DNS & TLS
Certificate is not issued
Causes (the most common):
- DNS record does not point to the server IP or is not yet propagated.
- Cloudflare proxy mode (orange cloud) is on and the HTTP-01 challenge is intercepted.
- Port 80 is blocked by the host (ufw) or occupied by another service.
- Let's Encrypt rate limit hit (50 certs/week per domain).
GC_CADDY_EMAILis empty — Caddy then uses ZeroSSL fallback, and if the ZeroSSL account is not registerable, no cert.
Check:
# DNS
dig +short A gate.example.com
# Port 80 von außen
curl -I http://gate.example.com/.well-known/acme-challenge/test
# Caddy-interner Zustand
docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ | jq '.apps.tls'
# Caddy-Log (ACME-Events)
docker logs gatecontrol 2>&1 | grep -i acme | tail -20
Fix:
- Cloudflare proxy: orange → grey, then reload Caddy:
docker exec gatecontrol caddy reload --config /app/config/Caddyfile - For staging tests (no rate limit risk): in
.envGC_CADDY_ACME_CA=https://acme-staging-v02.api.letsencrypt.org/directory. - Rate limit hit: wait 7 days or set up a DNS challenge (out of scope here).
Further reading: see Caddy server log + /data/caddy/ for local
ACME state files.
Admin UI not reachable although container healthy
Causes:
- Caddy is running, but the routing for the admin host is broken (external custom route placed on the admin domain).
- Firewall blocks port 443 at host level.
- DNS record still points to the old server.
Check:
# Von innen
docker exec gatecontrol curl -sk https://localhost/ -I
# Sollte 200/301 vom Node-Admin geben.
# Von außen
curl -I https://gate.example.com/
# Caddy-Routen
docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ \
| jq '.apps.http.servers'
Fix:
- If someone has accidentally created the admin domain as a custom route: delete the route in the UI or directly in the DB and reload Caddy.
- For ufw: see deployment.md §10.
Routes return 502
Causes:
- Backend (target IP:port) is not reachable.
- Backend is only reachable via VPN and the referenced peer has no handshake.
- Backend requires HTTPS, but route is configured on HTTP (or vice versa).
- Health check failure forces the route hard to "down".
Check:
# Von innen Richtung Backend testen
docker exec gatecontrol curl -v http://<target-ip>:<port>/
# Wenn Backend per Peer → Peer-Handshake prüfen
docker exec gatecontrol wg show wg0 latest-handshakes
# Caddy-Log für 502
docker logs gatecontrol 2>&1 | grep -E 'dial|502|upstream' | tail -20
Fix:
- Bring the peer online first (see "Handshake fails" below).
- If backend HTTPS: in route config activate Backend HTTPS and
set
backend_tls_insecureif self-signed. - Check health check failure via UI (route detail → uptime status).
WireGuard
Peer doesn't get an IP / config is faulty
Causes:
- Peer was created in UI, but the server hasn't reloaded the config yet.
- Allowed-IPs conflict (two peers on the same IP).
GC_WG_SUBNETmisconfigured — VPN subnet and peer IP block don't match.
Check:
docker exec gatecontrol wg show wg0
docker exec gatecontrol cat /etc/wireguard/wg0.conf | grep -A 2 AllowedIPs
Fix:
- In the UI save the peer again → triggers a
wg syncconf. - For IP conflicts: re-assign peer IPs in the UI.
Handshake fails
Symptoms: peer says "connected", but wg show wg0 shows for that peer
latest handshake: 0 or "(none)", transfer: 0 B received.
Causes:
- Port 51820/UDP blocked on the server firewall.
- Symmetric NAT at the client (mobile carrier).
- MTU problem — handshake works, but data packets are fragmented and dropped.
- Client has wrong public key or endpoint.
Check:
# Port-Test vom Client
nc -uvz gate.example.com 51820
# Server-Firewall
iptables -L INPUT -n -v | grep 51820
ufw status | grep 51820
# Paket-Capture auf dem Server
docker exec gatecontrol tcpdump -ni any udp port 51820 -c 20
# MTU auf wg0
docker exec gatecontrol ip link show wg0
Fix:
- Open UDP 51820 in ufw/iptables:
ufw allow 51820/udp. - Lower MTU:
GC_WG_MTU=1380in.env, recreate container, redistribute client config. - Client-side: compare endpoint + server public key against the UI.
Peer is online but no traffic
Symptoms: wg show shows handshake, transfer counts up. But client
says "no internet/no server reachability".
Causes:
- MASQUERADE rule missing or points to wrong interface.
net.ipv4.ip_forward = 0(very rare —entrypoint.shturns that up, but restricted host namespaces can override this).- Split tunnel config on the client excludes
0.0.0.0/0.
Check:
docker exec gatecontrol iptables -t nat -L POSTROUTING -n -v | grep 10.8.0
docker exec gatecontrol sysctl net.ipv4.ip_forward
# Soll: net.ipv4.ip_forward = 1
docker exec gatecontrol ip route get 1.1.1.1
Is the interface in the MASQUERADE rule (-o eth0 or -o ens18) the
actual egress interface? ip route shows the default interface.
If that doesn't match, NAT doesn't take effect.
Fix:
- Set
GC_NET_INTERFACEexplicitly (default is auto-detect). Value must matchip route show default. entrypoint.shrewrites existingwg0.confPostUp rules at startup if the interface no longer exists — see line 149 ff.- For Synology/OVH: often the interface is
ens18oreth0.VLAN, not plaineth0.
Multiple VPN conflicts
Symptom: client connects, but has another VPN/DNS connection active in parallel (corporate VPN, Tailscale, …). Consequence: DNS resolution shows inconsistent, API hostname resolves to the wrong IP.
Cause: Android/iOS client caches the old DNS entry across the tunnel change.
Fix:
- Client side: end the VPN before the GateControl connect and flush the DNS cache.
- The GateControl Android client codebase solves this programmatically with
preResolveDns()before tunnel start andvpnSafeDnsin OkHttp — that's client side, not server side. But good to know when debugging.
Home-Gateway
A home gateway is a separate Docker container that runs on a NAS/RPi/mini PC in the home network and builds an outbound tunnel to the GateControl server. Concept: concepts/home-gateway.md.
Gateway card stays "offline" despite running container
Cause (fixed since v1.50.1, but relevant for older instances):
Previously the UI took its own self-check flags as truth; since v1.50.1
route_reachability (empirically from measured traffic) is the primary value.
If it still shows "offline" today:
- Heartbeat is not arriving.
route_reachabilityis stale.
Check:
# Gateway-Peers + letzter Heartbeat
docker exec gatecontrol sqlite3 /data/gatecontrol.db <<'SQL'
SELECT p.name, p.peer_type, gm.last_seen_at, gm.last_config_hash, gm.last_health
FROM peers p
LEFT JOIN gateway_meta gm ON gm.peer_id = p.id
WHERE p.peer_type = 'gateway';
SQL
# WireGuard-Handshake
docker exec gatecontrol wg show wg0 latest-handshakes
Fix:
- If heartbeat is
NULLor old: on the gateway host check the companion logs (docker logs gateway). Typical: token revoked orGC_BASE_URLin gateway.envwrong. - If handshake is recent but heartbeat old: gateway container is restarting cyclically — check logs on the gateway host.
Config hash out of sync
Symptom: UI shows "gateway configuration not in sync" or the heartbeat response contains hash-mismatch warnings.
Cause: Caddy /load race on fast change of L4 routes on the
server. The server computes the hash before the push, Caddy commits asynchronously,
and the gateway in between saw the old hash.
Self-heal: the server reconciles automatically after 60 s (on every heartbeat cycle that sees a mismatch). If not resolved after 2 minutes:
Check:
# Server-Sicht
docker exec gatecontrol curl -s http://127.0.0.1:3000/api/v1/gateways \
-H "Cookie: <admin-session>" | jq '.[].config_hash_status'
Fix:
- On the server: save a small edit on any L4 route → forces a full config push.
- Last resort: restart the gateway container (
docker compose restart gatewayon the NAS).
Heartbeat doesn't arrive
Causes:
- WG tunnel to the gateway is down (handshake > 3 min old).
- Gateway token revoked or expired.
- Firewall at the server blocks heartbeat HTTPS (rare — goes through Caddy).
- Gateway
.envhas wrongGC_BASE_URL.
Check:
# Am NAS
docker logs gateway --tail 50
docker exec gateway wg show
Fix:
- Open peer in UI → re-issue gateway pairing → deploy new
.env, recreate container. - Also see backup-and-restore.md §5.
Gateway traffic has high latency / RDP drops
Cause: MTU topic. Full path MTU discovery often doesn't get through the WG tunnel because ICMP packets in the way are filtered.
Fix: MSS clamping is default on in the entrypoint from v1.41 (TCPMSS --clamp-mss-to-pmtu on FORWARD chain). If that's not enough:
# In .env
GC_WG_MTU=1380
then recreate container. Synology firewall does not allow additional manipulation — there manual tuning on the NAS-side wg tunnel is needed.
RDP
RDP connection drops after login
Cause: almost always MTU/MSS. The login still runs with small TLS handshakes; as soon as the desktop session sends larger frames (window updates), packets drop.
Check:
docker exec gatecontrol ip link show wg0 | grep mtu
# wg0 sollte mtu 1420 oder 1380 zeigen.
Fix:
GC_WG_MTU=1380on the server side.- On the home gateway analogously.
- MSS clamping is default on (see above). If still problems: on the NAS router additionally set MSS clamping.
RDP credentials don't work
Cause: master key rotation is active and the last rotation was not acknowledged on the gateway yet.
Check:
# Pending rotations
docker exec gatecontrol curl -s http://127.0.0.1:3000/api/v1/rdp/rotation/pending \
-H "Cookie: <admin-session>" | jq
Fix:
- In the UI go to the RDP host → Confirm rotation.
- Or via API:
POST /api/v1/rdp/:id/rotation/ack.
Wake-on-LAN does not trigger
Causes:
- MAC address in the RDP host profile wrong or empty.
- MAC cache on the gateway is cold — the gateway only knows the host MAC if it has communicated at least once.
- Target host is not reachable via broadcast in the home network (managed switch filters broadcasts).
Check:
# UI: RDP-Host-Detail → zeigt konfigurierte MAC
# Oder direkt aus DB:
docker exec gatecontrol sqlite3 /data/gatecontrol.db \
"SELECT name, wol_enabled, wol_mac_address FROM rdp_routes;"
Fix:
- Enter the MAC manually in the UI.
- First bring the host online once and ping briefly (so that the gateway discovers the MAC).
- Explicitly set the subnet broadcast as the broadcast address
(e.g.
192.168.1.255instead of255.255.255.255).
Database
SQLite lock errors
Symptom: log entries SQLITE_BUSY: database is locked, UI requests
timeout sporadically.
Causes:
- Several processes writing in parallel (shouldn't happen — Node is single writer).
- WAL checkpoint is blocked because a long-running read (backup, export) keeps the WAL open.
- Disk slow/full.
Check:
docker exec gatecontrol ls -lh /data/gatecontrol.db*
df -h
Expectation: .db, .db-wal, .db-shm. WAL should rarely be > 50 MB.
Fix:
- Force checkpoint:
docker exec gatecontrol sqlite3 /data/gatecontrol.db "PRAGMA wal_checkpoint(TRUNCATE);" - Free disk space.
- When in doubt: restart the container, solves 99 % of these problems.
Migrations failed
Symptom: container crashes after upgrade with a stack trace in the log starting with
Migration failed: <name>.
Check:
docker logs gatecontrol 2>&1 | grep -B 2 -A 20 'Migration failed'
# Aktueller Migrations-Stand in DB
docker exec gatecontrol sqlite3 /data/gatecontrol.db \
"SELECT version, name, applied_at FROM migration_history ORDER BY version DESC LIMIT 10;"
Fix:
- Document the failing migration, check the issue.
- Rollback to the previous version (see upgrade.md §6).
- If rollback fails due to newly added columns: restore backup onto a fresh volume of the old version.
DB corrupt
Symptom: SQLITE_CORRUPT, disk image is malformed, container crashes.
Causes:
- Host OS crash during write (non-graceful power off).
- Disk failure.
- Container hard-killed during DDL (very rare).
Check:
docker exec gatecontrol sqlite3 /data/gatecontrol.db "PRAGMA integrity_check;"
Fix:
- Rebuild from JSON backup (see backup-and-restore.md §4).
- Alternatively: rename DB file, restart container, schema is created
fresh, then restore. Keep
/data/.encryption_key, otherwise peer private keys can't be decrypted.
Logging & Debugging
Where the logs are
- App log (Node + Caddy + dnsmasq + supervisord) →
docker logs gatecontrol docker logs -f gatecontrol # follow docker logs --since 10m gatecontrol - Caddy access log →
/data/caddy/access.log(only if explicitly enabled in the Caddy config). - Auto-update log →
/var/log/gatecontrol-update.log(only if you have integratedupdate.shvia cron). - Activity log in DB →
/api/v1/logsor in UI Settings → Logs.
Turn up the log level
# In .env
GC_LOG_LEVEL=debug
docker compose up -d --force-recreate gatecontrol
Outputs Node-Pino debug lines. Caution: on prod running permanently produces too much
volume. After debug back to info.
Debugging a single route
In the UI turn on the flag Debug tracing on the route
(debug_enabled=1). Then per request a trace entry into the trace table.
Read via:
curl -sS -H "Authorization: Bearer $GC_API_TOKEN" \
"https://gate.example.com/api/v1/routes/<route-id>/trace?limit=50" | jq
Careful: this costs performance — switch off again as soon as you have isolated the case.
Inspect Caddy config
docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ | jq
docker exec gatecontrol caddy validate --config /data/caddy/runtime.json
WireGuard state
docker exec gatecontrol wg show wg0
docker exec gatecontrol wg show wg0 dump
docker exec gatecontrol cat /etc/wireguard/wg0.conf
Known Ops gotchas
Collection from production incidents that came up more than once.
Synology gateway routing with /32 subnet
When the Windows/desktop client interprets its Address = 10.8.0.6/32 (instead of /24),
the kill-switch logic of the client computes only its own IP
as "VPN subnet" instead of 10.8.0.0/24. Then kill-switch and DNS
fail at the server level (routes to 10.8.0.1 are rejected as off-VPN). If
users report that the kill-switch kills the VPN route, the cause is
almost always the /32 computation client-side — that's a client-code problem,
not server.
Node runs as root in the container
Node needs root because it has to call the WireGuard CLI (wg, wg-quick),
to manage the interface. That is by design, not a bug. Security scanners
flag this — ignore.
COEP disabled, bcryptjs + argon2 in parallel
Deliberate design decisions:
- COEP (Cross-Origin-Embedder-Policy) is off, because the Caddy admin UI renders iframe previews. Is tracked as accepted risk.
- bcryptjs + argon2 in parallel because existing passwords (old installations) are still bcrypt and are only upgraded to argon2 on the next login.
This comes up regularly in security reviews — do not report as a finding.
systemd-resolved vs. dnsmasq
systemd-resolved binds on 127.0.0.53:53 and does not collide with the
container-internal dnsmasq (which binds on 127.0.0.1:53 and 10.8.0.1:53).
Other host DNS services (bind9, unbound, NetworkManager dnsmasq, libvirt
dnsmasq) on 127.0.0.1:53 abort the container start with a clear message
— see Container / Host "Container doesn't start".
Cloudflare proxy off = mandatory for HTTP-01
Orange cloud on Cloudflare intercepts port 80 traffic. Let's Encrypt HTTP-01 then doesn't see the server but Cloudflare. Either proxy off (grey) or configure a DNS challenge.
When to ask for help
If you can't narrow down the case yourself, collect a diagnostic bundle:
mkdir -p /tmp/gc-diag-$(date -u +%F) && cd /tmp/gc-diag-$(date -u +%F)
# Container-Logs (letzte 500 Zeilen)
docker logs --tail 500 gatecontrol > app.log 2>&1
# Container-Metadaten (State, Config, Mounts)
docker inspect gatecontrol > inspect.json
# Versionen
docker image inspect ghcr.io/callmetechie/gatecontrol:latest \
--format '{{ index .RepoDigests 0 }} / {{ .Created }}' > image-version.txt
# WireGuard
docker exec gatecontrol wg show wg0 > wg-show.txt 2>&1
docker exec gatecontrol wg show wg0 dump > wg-dump.txt 2>&1
# Caddy-Config
docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ > caddy-config.json
# Health-Check
curl -s http://127.0.0.1:3000/health > health.json
# Host-Metadaten
uname -a > host.txt
ip route > routes.txt
ss -lntup > ports.txt
# Anonymisieren (optional, aber gute Idee):
sed -i -E 's/[a-zA-Z0-9+/]{43}=/REDACTED_PUBKEY/g' wg-*.txt caddy-*.json
tar czf ../diag-bundle.tar.gz -C .. "gc-diag-$(date -u +%F)"
Plus relevant CHANGELOG lines (which version is running, what was the last change) and screenshots of UI warnings. With that someone can from remote categorise in 5 minutes where the problem sits, instead of spending half an hour asking back.
For the standard reference of API and concepts:
- API.md — endpoint reference
- USER-GUIDE.md — feature operation
- concepts/home-gateway.md — gateway architecture
- concepts/routing.md — route model