Documentation

Production operations

Backups, restoration, scaling, log aggregation, alerting — what you need to keep self-hosted AppEngine alive.

A self-hosted AppEngine in production needs the same operational machinery as any other Node-plus-database service. This page is the runbook companion — what to monitor, what to alert on, when to scale. The source for the policy below is appengine/PRODUCTION_OPERATIONS_CONTROL_GUIDE.md.

What to monitor

Health endpoints

AppEngine exposes a small monitoring API:

EndpointReturns
GET /monitoring/healthOverall: status, mongo state, redis state
GET /monitoring/overviewSystem overview dashboard data
GET /monitoring/system-metricsCPU, memory, event-loop lag
GET /monitoring/queuesBullMQ queue depths, processing rates
GET /monitoring/alertsRecent alert history
GET /monitoring/user-activityActivity counters
GET /monitoring/usagePer-org usage stats

Wire these to whatever you use — Prometheus scraper, Datadog HTTP check, simple uptime ping. The default is to alert on any /monitoring/health response that isn't 200 {status: "ok"}.

Key metrics

MetricSourceAlert threshold
API response time p95response-time interceptor>500ms sustained 5min
Error ratelogs>1% over 5min
Event loop lagsystem-metrics>100ms
Heap usagesystem-metrics>1.5 GB
MongoDB query time p95mongo profiler>100ms
Redis hit rateredis info<70%
Queue depth/monitoring/queues>10000 jobs
Failed jobs/monitoring/queues>100/5min
Disk freehost<10%
Cert expiryTLS endpoint<14 days

Auth-specific alerts

Authentication is the most attacked surface. Track:

  • Failed login attempts per IP — alert >10/min
  • Token refresh frequency per user — alert >5/min
  • JWT validation failures — alert >20/min
  • API key auth failures — alert >50/5min

The default rate limit is 100 req/min/IP (RATE_LIMIT_MAX). Tune based on your patterns.

Logging

AppEngine writes:

  • stdout — structured logs (JSON when NODE_ENV=production)
  • ./error.log — fatal/error level
  • ./logs/combined.log — everything

In Kubernetes, scrape stdout via your cluster log collector (Fluent Bit → Loki, or the cloud's native — CloudWatch, Stackdriver). On a single Docker host, use Docker's log driver (json-file, gelf, syslog, journald).

Critical log points the source guide calls out:

  • /src/users/auth/jwt.auth.guard.ts — JWT validation failures
  • /src/users/auth/apikey.auth.guard.ts — API key auth
  • /src/middlewares/current-user.middleware.ts — user context errors
  • Authentication events (every login attempt, token op)
  • Permission violations
  • Database connection failures
  • External-service failures (timeouts, rate limits)
  • Business-logic errors (validation, workflow)

Backups

MongoDB

Daily full + hourly oplog tail is the production standard.

Full daily (Kubernetes CronJob shown earlier in Kubernetes setup; equivalent on a single box):

mongodump \
  --uri "$MONGODB_CONN" \
  --gzip \
  --archive=/backup/mongo-$(date +%Y%m%d-%H%M%S).gz

Retain 14 days local, 90+ days off-site (S3 or equivalent with object versioning).

Hourly oplog for point-in-time recovery to within ~1 hour:

mongodump \
  --uri "$MONGODB_CONN" \
  --db local --collection oplog.rs \
  --query "{\"ts\":{\"\$gt\":{\"\$timestamp\":{\"t\":$(date -d '1 hour ago' +%s),\"i\":1}}}}" \
  --out /backup/oplog-$(date +%Y%m%d-%H)

Redis

Redis persistence is configured via --appendonly yes (the manifests in earlier pages set this). The AOF file gives you replay-from-disk on restart. For backups proper:

redis-cli --rdb /backup/redis-$(date +%Y%m%d).rdb

Redis holds ephemeral state (sessions, queues). It's tolerable to lose recent Redis state in a disaster — Mongo is the system of record. Don't backup Redis if it costs you anything.

Object storage

Whatever you use (S3, Spaces, R2, MinIO) — turn on versioning + lifecycle rules. The blast radius of bad code deleting files is too large to operate without it.

Test restoration

A backup you've never restored isn't a backup. Quarterly minimum:

  1. Spin up a staging AppEngine pointing at an empty Mongo
  2. Restore last night's dump (mongorestore --gzip --archive=...)
  3. Boot AppEngine, verify root login, verify a known record exists
  4. Time the whole thing — that's your RTO

If RTO is unacceptable (>4h for a 100GB DB), look into MongoDB Atlas managed backups or a hot-standby replica.

Scaling

When to scale AppEngine pods

Symptoms: response time creeping up, event-loop lag >100ms, CPU >70% sustained. AppEngine is mostly stateless — adding replicas is safe. Start at 3 replicas; add 1-2 more before peak periods.

kubectl scale deployment/appengine --replicas=5

If autoscaler is in place, raise the maxReplicas and let it grow automatically.

When to scale MongoDB

Symptoms: query time >100ms p95, working set spilling to disk, IOPS pegged. Options in order of disruption:

  1. More RAM — keep working set in memory. Cheapest fix.
  2. Faster disk — gp3 → io2 / NVMe. Doubles IOPS without code change.
  3. Indexesdb.runCommand({collStats: 'collection'}) to find missing indexes. Run explain() on slow queries.
  4. Replica set + read preference — split reads across secondaries. Requires changing MONGODB_CONN to a replica-set URI.
  5. Sharding — split by org id. Major undertaking; do this only when single-replica capacity is exhausted.

When to scale Redis

Symptoms: command latency >5ms, memory >80%. Options:

  1. More memorymaxmemory to actual RAM minus headroom.
  2. Eviction policyallkeys-lru if cache, noeviction if queues (default). Mixed — split into two Redis instances.
  3. Cluster mode — split keyspace. AppEngine supports Redis Cluster URIs.

When to scale workers

Symptoms: queue depth growing, jobs taking longer than expected, monitoring/queues showing stalled jobs. Run a separate AppEngine deployment with worker-only env:

# worker deployment
env:
  - name: ENABLE_SYNC_PROCESSORS
    value: 'true'
  - name: ENABLE_SYNC_CONSUMERS
    value: 'true'
  - name: ENABLE_SYNC_JOBS
    value: 'true'
  # API not exposed; readiness probe on a different path or none

And turn those off on the API replicas. Lets you scale workers independently of API.

Common runbook entries

Authentication failure spike

Symptoms: alerts firing on failed logins. Steps:

  1. Check /monitoring/alerts for the offender IPs.
  2. Block at the ingress / firewall (Cloudflare, AWS WAF, etc.).
  3. If distributed: enable rate limiting more aggressively (RATE_LIMIT_MAX=30).
  4. Check whether the legitimate user pattern is impacted — if so, rotate strategy (CAPTCHA, MFA enforcement).
  5. Audit logs for any successful logins from suspicious IPs in the same window.

MongoDB connection lost

Symptoms: 500s on every endpoint, /monitoring/health returns mongodb: down.

  1. kubectl get pods -n appmint — is Mongo running?
  2. kubectl logs mongodb-0 — connection errors? OOM?
  3. Verify connection string in the secret (MONGODB_CONN).
  4. If MongoDB is up but AppEngine can't reach: network policy? DNS?
  5. Roll AppEngine pods (kubectl rollout restart deployment/appengine) — they reconnect on boot.

High memory usage

Symptoms: heap >1.5 GB, container OOM-killed.

  1. Check /monitoring/system-metrics for the trend.
  2. Take a heap dump: send SIGUSR2 to the process (Node.js feature) or use kubectl exec + the inspector.
  3. Look for handle leaks: lsof -p <pid> | wc -l should be stable.
  4. Check for runaway queries — a single query returning 1M docs without pagination is the usual culprit.
  5. As a temporary fix, raise MEMORY_LIMIT and add a replica.

Queue backed up

Symptoms: /monitoring/queues showing >10k pending.

  1. Identify the queue: bull-board UI is wired by default.
  2. Are workers running? kubectl get pods -l role=worker.
  3. Are jobs failing? Look at the failed-jobs list — usually a vendor API outage (Stripe, Twilio).
  4. Scale workers up: kubectl scale deployment/appengine-worker --replicas=10.
  5. If a specific job is poison-pill: drain it manually (Bull-board → remove job).

External service down

Symptoms: errors mentioning Twilio/Stripe/etc., usage /monitoring/usage shows external errors.

  1. Check the vendor's status page first.
  2. AppEngine retries with exponential backoff by default — most transient outages self-heal.
  3. For prolonged outages, enable a feature flag (STRIPE_DISABLED=true) so the UI gracefully degrades.
  4. Notify customers via the chat module's broadcast mechanism if customer-facing.

Daily, weekly, monthly tasks

Daily

  • Skim /monitoring/health and /monitoring/alerts
  • Verify last night's backup completed and was uploaded off-site
  • Check error log for new patterns

Weekly

  • Review API usage trends — anyone hitting quota limits?
  • Validate restore — at least a smoke test against a recent dump in a scratch env
  • Check certificate expiries (cert-manager, Let's Encrypt) for next 30 days
  • Review security-related logs for anomalies

Monthly

  • Full restore drill (timed)
  • Vulnerability scan: npm audit against your fork; trivy image against the deployed image
  • Review JWT secret age — rotate if >90 days
  • Capacity plan: project growth, decide on scaling moves
  • Review and tune monitoring thresholds based on the past 30 days

What good operations looks like

  • The on-call engineer can find any error in the logs within 60 seconds
  • An AppEngine pod can be killed by hand and traffic doesn't notice
  • A MongoDB restore from yesterday's backup completes in under an hour
  • New deploys roll out with zero downtime
  • Alerts fire before customers complain, not after

If you're not there yet, prioritize in that order: log aggregation first (you can't fix what you can't see), then health checks + autoscaling, then backups, then performance tuning.