CallMeTechie
DE Login
Home Products Blog About Contact

Troubleshooting

Troubleshooting · Updated 1 month ago

Audience: Ops in field deployment. The entries follow the pattern symptom → causes → check → fix. Every command is such that you can paste it directly into the terminal.

For systemic references see deployment.md, upgrade.md, backup-and-restore.md.


Table of contents


Container / Host

GateControl container doesn't start / crashes after start

Causes:

  1. .env missing or contains invalid mandatory values (GC_ADMIN_PASSWORD, GC_WG_HOST).
  2. Port 53 on 127.0.0.1 is occupied (another DNS resolver on the host).
  3. Ports 80/443/51820 are occupied by another service.
  4. Docker volume corrupted (rare SQLite lock leaks).
  5. New version has a migration problem (see upgrade.md §10).

Check:

docker logs --tail 100 gatecontrol

Look for:

  • ERROR: GC_ADMIN_PASSWORD is not set or still default → check .env.
  • ERROR: GC_WG_HOST is not set or still the example value → check .env.
  • ERROR: 127.0.0.1:53 is already bound — dnsmasq cannot start → port-53 holder:
    ss -lntup | grep ':53 '
    
    Typical candidates: named/bind9, unbound, dnsmasq (from NetworkManager or libvirt), pihole directly on the host.
  • Error: SQLITE_CORRUPT → restore from backup (see backup-and-restore.md).

Fix:

  • After env change: docker compose up -d --force-recreate gatecontrol.
  • Port conflict: stop the opposing process or map to other ports (with GateControl not easily possible — see network_mode: host).
  • If the container loops and you don't see anything concrete:
    docker inspect gatecontrol --format '{{ .State.Status }} — exit {{ .State.ExitCode }} — {{ .State.Error }}'
    

Health check stays on starting

Causes:

  1. Caddy is still waiting for certificates (first start, Let's Encrypt HTTP-01 is running).
  2. Database migrations take time (large existing DB, index rebuild).
  3. Node hangs at init step (e.g. DNS hosts file build).

Check:

docker logs --tail 50 gatecontrol
docker exec gatecontrol curl -s http://127.0.0.1:3000/health

checks.db: false → migrations not through, look further in the log. checks.wireguard: falsewg0 is not yet up, wg show wg0 should confirm that.

Fix:

First wait 60–90 seconds. If the state stays stable, probably an error state — see above.


High CPU/RAM load

Causes:

  1. Uptime monitoring runs on many targets with a short interval.
  2. Request tracing (debug_enabled) is on for a route with a lot of traffic.
  3. Traffic snapshot interval too short (GC_TRAFFIC_INTERVAL).
  4. Too many simultaneous Caddy reloads (every peer/route save triggers a /load).

Check:

docker stats gatecontrol --no-stream
docker exec gatecontrol top -b -n 1 | head -20

Process names in the output: node (admin server), caddy, dnsmasq, wg-quick, supervisord. Whoever needs over 60 % CPU sustained is the outlier.

node high → in UI check which routes have debug_enabled=1 and if necessary switch off individually. caddy high → many active connections/uptime checks — look at /api/v1/caddy/status.

Fix:

  • Turn up monitoring intervals (default 60 s is conservative, below 15 s problematic).
  • Switch off debug_enabled when done debugging.
  • Set GC_TRAFFIC_INTERVAL to 120 s.

DNS & TLS

Certificate is not issued

Causes (the most common):

  1. DNS record does not point to the server IP or is not yet propagated.
  2. Cloudflare proxy mode (orange cloud) is on and the HTTP-01 challenge is intercepted.
  3. Port 80 is blocked by the host (ufw) or occupied by another service.
  4. Let's Encrypt rate limit hit (50 certs/week per domain).
  5. GC_CADDY_EMAIL is empty — Caddy then uses ZeroSSL fallback, and if the ZeroSSL account is not registerable, no cert.

Check:

# DNS
dig +short A gate.example.com

# Port 80 von außen
curl -I http://gate.example.com/.well-known/acme-challenge/test

# Caddy-interner Zustand
docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ | jq '.apps.tls'

# Caddy-Log (ACME-Events)
docker logs gatecontrol 2>&1 | grep -i acme | tail -20

Fix:

  • Cloudflare proxy: orange → grey, then reload Caddy:
    docker exec gatecontrol caddy reload --config /app/config/Caddyfile
    
  • For staging tests (no rate limit risk): in .env GC_CADDY_ACME_CA=https://acme-staging-v02.api.letsencrypt.org/directory.
  • Rate limit hit: wait 7 days or set up a DNS challenge (out of scope here).

Further reading: see Caddy server log + /data/caddy/ for local ACME state files.


Admin UI not reachable although container healthy

Causes:

  1. Caddy is running, but the routing for the admin host is broken (external custom route placed on the admin domain).
  2. Firewall blocks port 443 at host level.
  3. DNS record still points to the old server.

Check:

# Von innen
docker exec gatecontrol curl -sk https://localhost/ -I
# Sollte 200/301 vom Node-Admin geben.

# Von außen
curl -I https://gate.example.com/

# Caddy-Routen
docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ \
  | jq '.apps.http.servers'

Fix:

  • If someone has accidentally created the admin domain as a custom route: delete the route in the UI or directly in the DB and reload Caddy.
  • For ufw: see deployment.md §10.

Routes return 502

Causes:

  1. Backend (target IP:port) is not reachable.
  2. Backend is only reachable via VPN and the referenced peer has no handshake.
  3. Backend requires HTTPS, but route is configured on HTTP (or vice versa).
  4. Health check failure forces the route hard to "down".

Check:

# Von innen Richtung Backend testen
docker exec gatecontrol curl -v http://<target-ip>:<port>/

# Wenn Backend per Peer → Peer-Handshake prüfen
docker exec gatecontrol wg show wg0 latest-handshakes

# Caddy-Log für 502
docker logs gatecontrol 2>&1 | grep -E 'dial|502|upstream' | tail -20

Fix:

  • Bring the peer online first (see "Handshake fails" below).
  • If backend HTTPS: in route config activate Backend HTTPS and set backend_tls_insecure if self-signed.
  • Check health check failure via UI (route detail → uptime status).

WireGuard

Peer doesn't get an IP / config is faulty

Causes:

  1. Peer was created in UI, but the server hasn't reloaded the config yet.
  2. Allowed-IPs conflict (two peers on the same IP).
  3. GC_WG_SUBNET misconfigured — VPN subnet and peer IP block don't match.

Check:

docker exec gatecontrol wg show wg0
docker exec gatecontrol cat /etc/wireguard/wg0.conf | grep -A 2 AllowedIPs

Fix:

  • In the UI save the peer again → triggers a wg syncconf.
  • For IP conflicts: re-assign peer IPs in the UI.

Handshake fails

Symptoms: peer says "connected", but wg show wg0 shows for that peer latest handshake: 0 or "(none)", transfer: 0 B received.

Causes:

  1. Port 51820/UDP blocked on the server firewall.
  2. Symmetric NAT at the client (mobile carrier).
  3. MTU problem — handshake works, but data packets are fragmented and dropped.
  4. Client has wrong public key or endpoint.

Check:

# Port-Test vom Client
nc -uvz gate.example.com 51820

# Server-Firewall
iptables -L INPUT -n -v | grep 51820
ufw status | grep 51820

# Paket-Capture auf dem Server
docker exec gatecontrol tcpdump -ni any udp port 51820 -c 20

# MTU auf wg0
docker exec gatecontrol ip link show wg0

Fix:

  • Open UDP 51820 in ufw/iptables: ufw allow 51820/udp.
  • Lower MTU: GC_WG_MTU=1380 in .env, recreate container, redistribute client config.
  • Client-side: compare endpoint + server public key against the UI.

Peer is online but no traffic

Symptoms: wg show shows handshake, transfer counts up. But client says "no internet/no server reachability".

Causes:

  1. MASQUERADE rule missing or points to wrong interface.
  2. net.ipv4.ip_forward = 0 (very rare — entrypoint.sh turns that up, but restricted host namespaces can override this).
  3. Split tunnel config on the client excludes 0.0.0.0/0.

Check:

docker exec gatecontrol iptables -t nat -L POSTROUTING -n -v | grep 10.8.0
docker exec gatecontrol sysctl net.ipv4.ip_forward
# Soll: net.ipv4.ip_forward = 1
docker exec gatecontrol ip route get 1.1.1.1

Is the interface in the MASQUERADE rule (-o eth0 or -o ens18) the actual egress interface? ip route shows the default interface. If that doesn't match, NAT doesn't take effect.

Fix:

  • Set GC_NET_INTERFACE explicitly (default is auto-detect). Value must match ip route show default.
  • entrypoint.sh rewrites existing wg0.conf PostUp rules at startup if the interface no longer exists — see line 149 ff.
  • For Synology/OVH: often the interface is ens18 or eth0.VLAN, not plain eth0.

Multiple VPN conflicts

Symptom: client connects, but has another VPN/DNS connection active in parallel (corporate VPN, Tailscale, …). Consequence: DNS resolution shows inconsistent, API hostname resolves to the wrong IP.

Cause: Android/iOS client caches the old DNS entry across the tunnel change.

Fix:

  • Client side: end the VPN before the GateControl connect and flush the DNS cache.
  • The GateControl Android client codebase solves this programmatically with preResolveDns() before tunnel start and vpnSafeDns in OkHttp — that's client side, not server side. But good to know when debugging.

Home-Gateway

A home gateway is a separate Docker container that runs on a NAS/RPi/mini PC in the home network and builds an outbound tunnel to the GateControl server. Concept: concepts/home-gateway.md.

Gateway card stays "offline" despite running container

Cause (fixed since v1.50.1, but relevant for older instances): Previously the UI took its own self-check flags as truth; since v1.50.1 route_reachability (empirically from measured traffic) is the primary value.

If it still shows "offline" today:

  1. Heartbeat is not arriving.
  2. route_reachability is stale.

Check:

# Gateway-Peers + letzter Heartbeat
docker exec gatecontrol sqlite3 /data/gatecontrol.db <<'SQL'
SELECT p.name, p.peer_type, gm.last_seen_at, gm.last_config_hash, gm.last_health
FROM peers p
LEFT JOIN gateway_meta gm ON gm.peer_id = p.id
WHERE p.peer_type = 'gateway';
SQL

# WireGuard-Handshake
docker exec gatecontrol wg show wg0 latest-handshakes

Fix:

  • If heartbeat is NULL or old: on the gateway host check the companion logs (docker logs gateway). Typical: token revoked or GC_BASE_URL in gateway .env wrong.
  • If handshake is recent but heartbeat old: gateway container is restarting cyclically — check logs on the gateway host.

Config hash out of sync

Symptom: UI shows "gateway configuration not in sync" or the heartbeat response contains hash-mismatch warnings.

Cause: Caddy /load race on fast change of L4 routes on the server. The server computes the hash before the push, Caddy commits asynchronously, and the gateway in between saw the old hash.

Self-heal: the server reconciles automatically after 60 s (on every heartbeat cycle that sees a mismatch). If not resolved after 2 minutes:

Check:

# Server-Sicht
docker exec gatecontrol curl -s http://127.0.0.1:3000/api/v1/gateways \
  -H "Cookie: <admin-session>" | jq '.[].config_hash_status'

Fix:

  • On the server: save a small edit on any L4 route → forces a full config push.
  • Last resort: restart the gateway container (docker compose restart gateway on the NAS).

Heartbeat doesn't arrive

Causes:

  1. WG tunnel to the gateway is down (handshake > 3 min old).
  2. Gateway token revoked or expired.
  3. Firewall at the server blocks heartbeat HTTPS (rare — goes through Caddy).
  4. Gateway .env has wrong GC_BASE_URL.

Check:

# Am NAS
docker logs gateway --tail 50
docker exec gateway wg show

Fix:


Gateway traffic has high latency / RDP drops

Cause: MTU topic. Full path MTU discovery often doesn't get through the WG tunnel because ICMP packets in the way are filtered.

Fix: MSS clamping is default on in the entrypoint from v1.41 (TCPMSS --clamp-mss-to-pmtu on FORWARD chain). If that's not enough:

# In .env
GC_WG_MTU=1380

then recreate container. Synology firewall does not allow additional manipulation — there manual tuning on the NAS-side wg tunnel is needed.


RDP

RDP connection drops after login

Cause: almost always MTU/MSS. The login still runs with small TLS handshakes; as soon as the desktop session sends larger frames (window updates), packets drop.

Check:

docker exec gatecontrol ip link show wg0 | grep mtu
# wg0 sollte mtu 1420 oder 1380 zeigen.

Fix:

  • GC_WG_MTU=1380 on the server side.
  • On the home gateway analogously.
  • MSS clamping is default on (see above). If still problems: on the NAS router additionally set MSS clamping.

RDP credentials don't work

Cause: master key rotation is active and the last rotation was not acknowledged on the gateway yet.

Check:

# Pending rotations
docker exec gatecontrol curl -s http://127.0.0.1:3000/api/v1/rdp/rotation/pending \
  -H "Cookie: <admin-session>" | jq

Fix:

  • In the UI go to the RDP host → Confirm rotation.
  • Or via API: POST /api/v1/rdp/:id/rotation/ack.

Wake-on-LAN does not trigger

Causes:

  1. MAC address in the RDP host profile wrong or empty.
  2. MAC cache on the gateway is cold — the gateway only knows the host MAC if it has communicated at least once.
  3. Target host is not reachable via broadcast in the home network (managed switch filters broadcasts).

Check:

# UI: RDP-Host-Detail → zeigt konfigurierte MAC
# Oder direkt aus DB:
docker exec gatecontrol sqlite3 /data/gatecontrol.db \
  "SELECT name, wol_enabled, wol_mac_address FROM rdp_routes;"

Fix:

  • Enter the MAC manually in the UI.
  • First bring the host online once and ping briefly (so that the gateway discovers the MAC).
  • Explicitly set the subnet broadcast as the broadcast address (e.g. 192.168.1.255 instead of 255.255.255.255).

Database

SQLite lock errors

Symptom: log entries SQLITE_BUSY: database is locked, UI requests timeout sporadically.

Causes:

  1. Several processes writing in parallel (shouldn't happen — Node is single writer).
  2. WAL checkpoint is blocked because a long-running read (backup, export) keeps the WAL open.
  3. Disk slow/full.

Check:

docker exec gatecontrol ls -lh /data/gatecontrol.db*
df -h

Expectation: .db, .db-wal, .db-shm. WAL should rarely be > 50 MB.

Fix:

  • Force checkpoint:
    docker exec gatecontrol sqlite3 /data/gatecontrol.db "PRAGMA wal_checkpoint(TRUNCATE);"
    
  • Free disk space.
  • When in doubt: restart the container, solves 99 % of these problems.

Migrations failed

Symptom: container crashes after upgrade with a stack trace in the log starting with Migration failed: <name>.

Check:

docker logs gatecontrol 2>&1 | grep -B 2 -A 20 'Migration failed'

# Aktueller Migrations-Stand in DB
docker exec gatecontrol sqlite3 /data/gatecontrol.db \
  "SELECT version, name, applied_at FROM migration_history ORDER BY version DESC LIMIT 10;"

Fix:

  • Document the failing migration, check the issue.
  • Rollback to the previous version (see upgrade.md §6).
  • If rollback fails due to newly added columns: restore backup onto a fresh volume of the old version.

DB corrupt

Symptom: SQLITE_CORRUPT, disk image is malformed, container crashes.

Causes:

  1. Host OS crash during write (non-graceful power off).
  2. Disk failure.
  3. Container hard-killed during DDL (very rare).

Check:

docker exec gatecontrol sqlite3 /data/gatecontrol.db "PRAGMA integrity_check;"

Fix:

  • Rebuild from JSON backup (see backup-and-restore.md §4).
  • Alternatively: rename DB file, restart container, schema is created fresh, then restore. Keep /data/.encryption_key, otherwise peer private keys can't be decrypted.

Logging & Debugging

Where the logs are

  • App log (Node + Caddy + dnsmasq + supervisord) →
    docker logs gatecontrol
    docker logs -f gatecontrol      # follow
    docker logs --since 10m gatecontrol
    
  • Caddy access log/data/caddy/access.log (only if explicitly enabled in the Caddy config).
  • Auto-update log/var/log/gatecontrol-update.log (only if you have integrated update.sh via cron).
  • Activity log in DB/api/v1/logs or in UI Settings → Logs.

Turn up the log level

# In .env
GC_LOG_LEVEL=debug

docker compose up -d --force-recreate gatecontrol

Outputs Node-Pino debug lines. Caution: on prod running permanently produces too much volume. After debug back to info.

Debugging a single route

In the UI turn on the flag Debug tracing on the route (debug_enabled=1). Then per request a trace entry into the trace table. Read via:

curl -sS -H "Authorization: Bearer $GC_API_TOKEN" \
  "https://gate.example.com/api/v1/routes/<route-id>/trace?limit=50" | jq

Careful: this costs performance — switch off again as soon as you have isolated the case.

Inspect Caddy config

docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ | jq
docker exec gatecontrol caddy validate --config /data/caddy/runtime.json

WireGuard state

docker exec gatecontrol wg show wg0
docker exec gatecontrol wg show wg0 dump
docker exec gatecontrol cat /etc/wireguard/wg0.conf

Known Ops gotchas

Collection from production incidents that came up more than once.

Synology gateway routing with /32 subnet

When the Windows/desktop client interprets its Address = 10.8.0.6/32 (instead of /24), the kill-switch logic of the client computes only its own IP as "VPN subnet" instead of 10.8.0.0/24. Then kill-switch and DNS fail at the server level (routes to 10.8.0.1 are rejected as off-VPN). If users report that the kill-switch kills the VPN route, the cause is almost always the /32 computation client-side — that's a client-code problem, not server.

Node runs as root in the container

Node needs root because it has to call the WireGuard CLI (wg, wg-quick), to manage the interface. That is by design, not a bug. Security scanners flag this — ignore.

COEP disabled, bcryptjs + argon2 in parallel

Deliberate design decisions:

  • COEP (Cross-Origin-Embedder-Policy) is off, because the Caddy admin UI renders iframe previews. Is tracked as accepted risk.
  • bcryptjs + argon2 in parallel because existing passwords (old installations) are still bcrypt and are only upgraded to argon2 on the next login.

This comes up regularly in security reviews — do not report as a finding.

systemd-resolved vs. dnsmasq

systemd-resolved binds on 127.0.0.53:53 and does not collide with the container-internal dnsmasq (which binds on 127.0.0.1:53 and 10.8.0.1:53). Other host DNS services (bind9, unbound, NetworkManager dnsmasq, libvirt dnsmasq) on 127.0.0.1:53 abort the container start with a clear message — see Container / Host "Container doesn't start".

Cloudflare proxy off = mandatory for HTTP-01

Orange cloud on Cloudflare intercepts port 80 traffic. Let's Encrypt HTTP-01 then doesn't see the server but Cloudflare. Either proxy off (grey) or configure a DNS challenge.


When to ask for help

If you can't narrow down the case yourself, collect a diagnostic bundle:

mkdir -p /tmp/gc-diag-$(date -u +%F) && cd /tmp/gc-diag-$(date -u +%F)

# Container-Logs (letzte 500 Zeilen)
docker logs --tail 500 gatecontrol > app.log 2>&1

# Container-Metadaten (State, Config, Mounts)
docker inspect gatecontrol > inspect.json

# Versionen
docker image inspect ghcr.io/callmetechie/gatecontrol:latest \
  --format '{{ index .RepoDigests 0 }} / {{ .Created }}' > image-version.txt

# WireGuard
docker exec gatecontrol wg show wg0 > wg-show.txt 2>&1
docker exec gatecontrol wg show wg0 dump > wg-dump.txt 2>&1

# Caddy-Config
docker exec gatecontrol curl -s http://127.0.0.1:2019/config/ > caddy-config.json

# Health-Check
curl -s http://127.0.0.1:3000/health > health.json

# Host-Metadaten
uname -a > host.txt
ip route > routes.txt
ss -lntup > ports.txt

# Anonymisieren (optional, aber gute Idee):
sed -i -E 's/[a-zA-Z0-9+/]{43}=/REDACTED_PUBKEY/g' wg-*.txt caddy-*.json

tar czf ../diag-bundle.tar.gz -C .. "gc-diag-$(date -u +%F)"

Plus relevant CHANGELOG lines (which version is running, what was the last change) and screenshots of UI warnings. With that someone can from remote categorise in 5 minutes where the problem sits, instead of spending half an hour asking back.

For the standard reference of API and concepts:

Cookie Settings

We use cookies to improve your experience. Essential cookies are always active.

Privacy Policy
ESC
↑↓ navigate open esc close