A self-hosted AppEngine in production needs the same operational machinery as any other Node-plus-database service. This page is the runbook companion — what to monitor, what to alert on, when to scale. The source for the policy below is appengine/PRODUCTION_OPERATIONS_CONTROL_GUIDE.md.
What to monitor
Health endpoints
AppEngine exposes a small monitoring API:
| Endpoint | Returns |
|---|---|
GET /monitoring/health | Overall: status, mongo state, redis state |
GET /monitoring/overview | System overview dashboard data |
GET /monitoring/system-metrics | CPU, memory, event-loop lag |
GET /monitoring/queues | BullMQ queue depths, processing rates |
GET /monitoring/alerts | Recent alert history |
GET /monitoring/user-activity | Activity counters |
GET /monitoring/usage | Per-org usage stats |
Wire these to whatever you use — Prometheus scraper, Datadog HTTP check, simple uptime ping. The default is to alert on any /monitoring/health response that isn't 200 {status: "ok"}.
Key metrics
| Metric | Source | Alert threshold |
|---|---|---|
| API response time p95 | response-time interceptor | >500ms sustained 5min |
| Error rate | logs | >1% over 5min |
| Event loop lag | system-metrics | >100ms |
| Heap usage | system-metrics | >1.5 GB |
| MongoDB query time p95 | mongo profiler | >100ms |
| Redis hit rate | redis info | <70% |
| Queue depth | /monitoring/queues | >10000 jobs |
| Failed jobs | /monitoring/queues | >100/5min |
| Disk free | host | <10% |
| Cert expiry | TLS endpoint | <14 days |
Auth-specific alerts
Authentication is the most attacked surface. Track:
- Failed login attempts per IP — alert >10/min
- Token refresh frequency per user — alert >5/min
- JWT validation failures — alert >20/min
- API key auth failures — alert >50/5min
The default rate limit is 100 req/min/IP (RATE_LIMIT_MAX). Tune based on your patterns.
Logging
AppEngine writes:
stdout— structured logs (JSON whenNODE_ENV=production)./error.log— fatal/error level./logs/combined.log— everything
In Kubernetes, scrape stdout via your cluster log collector (Fluent Bit → Loki, or the cloud's native — CloudWatch, Stackdriver). On a single Docker host, use Docker's log driver (json-file, gelf, syslog, journald).
Critical log points the source guide calls out:
/src/users/auth/jwt.auth.guard.ts— JWT validation failures/src/users/auth/apikey.auth.guard.ts— API key auth/src/middlewares/current-user.middleware.ts— user context errors- Authentication events (every login attempt, token op)
- Permission violations
- Database connection failures
- External-service failures (timeouts, rate limits)
- Business-logic errors (validation, workflow)
Backups
MongoDB
Daily full + hourly oplog tail is the production standard.
Full daily (Kubernetes CronJob shown earlier in Kubernetes setup; equivalent on a single box):
mongodump \
--uri "$MONGODB_CONN" \
--gzip \
--archive=/backup/mongo-$(date +%Y%m%d-%H%M%S).gz
Retain 14 days local, 90+ days off-site (S3 or equivalent with object versioning).
Hourly oplog for point-in-time recovery to within ~1 hour:
mongodump \
--uri "$MONGODB_CONN" \
--db local --collection oplog.rs \
--query "{\"ts\":{\"\$gt\":{\"\$timestamp\":{\"t\":$(date -d '1 hour ago' +%s),\"i\":1}}}}" \
--out /backup/oplog-$(date +%Y%m%d-%H)
Redis
Redis persistence is configured via --appendonly yes (the manifests in earlier pages set this). The AOF file gives you replay-from-disk on restart. For backups proper:
redis-cli --rdb /backup/redis-$(date +%Y%m%d).rdb
Redis holds ephemeral state (sessions, queues). It's tolerable to lose recent Redis state in a disaster — Mongo is the system of record. Don't backup Redis if it costs you anything.
Object storage
Whatever you use (S3, Spaces, R2, MinIO) — turn on versioning + lifecycle rules. The blast radius of bad code deleting files is too large to operate without it.
Test restoration
A backup you've never restored isn't a backup. Quarterly minimum:
- Spin up a staging AppEngine pointing at an empty Mongo
- Restore last night's dump (
mongorestore --gzip --archive=...) - Boot AppEngine, verify root login, verify a known record exists
- Time the whole thing — that's your RTO
If RTO is unacceptable (>4h for a 100GB DB), look into MongoDB Atlas managed backups or a hot-standby replica.
Scaling
When to scale AppEngine pods
Symptoms: response time creeping up, event-loop lag >100ms, CPU >70% sustained. AppEngine is mostly stateless — adding replicas is safe. Start at 3 replicas; add 1-2 more before peak periods.
kubectl scale deployment/appengine --replicas=5
If autoscaler is in place, raise the maxReplicas and let it grow automatically.
When to scale MongoDB
Symptoms: query time >100ms p95, working set spilling to disk, IOPS pegged. Options in order of disruption:
- More RAM — keep working set in memory. Cheapest fix.
- Faster disk — gp3 → io2 / NVMe. Doubles IOPS without code change.
- Indexes —
db.runCommand({collStats: 'collection'})to find missing indexes. Runexplain()on slow queries. - Replica set + read preference — split reads across secondaries. Requires changing
MONGODB_CONNto a replica-set URI. - Sharding — split by org id. Major undertaking; do this only when single-replica capacity is exhausted.
When to scale Redis
Symptoms: command latency >5ms, memory >80%. Options:
- More memory —
maxmemoryto actual RAM minus headroom. - Eviction policy —
allkeys-lruif cache,noevictionif queues (default). Mixed — split into two Redis instances. - Cluster mode — split keyspace. AppEngine supports Redis Cluster URIs.
When to scale workers
Symptoms: queue depth growing, jobs taking longer than expected, monitoring/queues showing stalled jobs. Run a separate AppEngine deployment with worker-only env:
# worker deployment
env:
- name: ENABLE_SYNC_PROCESSORS
value: 'true'
- name: ENABLE_SYNC_CONSUMERS
value: 'true'
- name: ENABLE_SYNC_JOBS
value: 'true'
# API not exposed; readiness probe on a different path or none
And turn those off on the API replicas. Lets you scale workers independently of API.
Common runbook entries
Authentication failure spike
Symptoms: alerts firing on failed logins. Steps:
- Check
/monitoring/alertsfor the offender IPs. - Block at the ingress / firewall (Cloudflare, AWS WAF, etc.).
- If distributed: enable rate limiting more aggressively (
RATE_LIMIT_MAX=30). - Check whether the legitimate user pattern is impacted — if so, rotate strategy (CAPTCHA, MFA enforcement).
- Audit logs for any successful logins from suspicious IPs in the same window.
MongoDB connection lost
Symptoms: 500s on every endpoint, /monitoring/health returns mongodb: down.
kubectl get pods -n appmint— is Mongo running?kubectl logs mongodb-0— connection errors? OOM?- Verify connection string in the secret (
MONGODB_CONN). - If MongoDB is up but AppEngine can't reach: network policy? DNS?
- Roll AppEngine pods (
kubectl rollout restart deployment/appengine) — they reconnect on boot.
High memory usage
Symptoms: heap >1.5 GB, container OOM-killed.
- Check
/monitoring/system-metricsfor the trend. - Take a heap dump: send
SIGUSR2to the process (Node.js feature) or usekubectl exec+ the inspector. - Look for handle leaks:
lsof -p <pid> | wc -lshould be stable. - Check for runaway queries — a single query returning 1M docs without pagination is the usual culprit.
- As a temporary fix, raise
MEMORY_LIMITand add a replica.
Queue backed up
Symptoms: /monitoring/queues showing >10k pending.
- Identify the queue:
bull-boardUI is wired by default. - Are workers running?
kubectl get pods -l role=worker. - Are jobs failing? Look at the failed-jobs list — usually a vendor API outage (Stripe, Twilio).
- Scale workers up:
kubectl scale deployment/appengine-worker --replicas=10. - If a specific job is poison-pill: drain it manually (Bull-board → remove job).
External service down
Symptoms: errors mentioning Twilio/Stripe/etc., usage /monitoring/usage shows external errors.
- Check the vendor's status page first.
- AppEngine retries with exponential backoff by default — most transient outages self-heal.
- For prolonged outages, enable a feature flag (
STRIPE_DISABLED=true) so the UI gracefully degrades. - Notify customers via the chat module's broadcast mechanism if customer-facing.
Daily, weekly, monthly tasks
Daily
- Skim
/monitoring/healthand/monitoring/alerts - Verify last night's backup completed and was uploaded off-site
- Check error log for new patterns
Weekly
- Review API usage trends — anyone hitting quota limits?
- Validate restore — at least a smoke test against a recent dump in a scratch env
- Check certificate expiries (cert-manager, Let's Encrypt) for next 30 days
- Review security-related logs for anomalies
Monthly
- Full restore drill (timed)
- Vulnerability scan:
npm auditagainst your fork;trivy imageagainst the deployed image - Review JWT secret age — rotate if >90 days
- Capacity plan: project growth, decide on scaling moves
- Review and tune monitoring thresholds based on the past 30 days
What good operations looks like
- The on-call engineer can find any error in the logs within 60 seconds
- An AppEngine pod can be killed by hand and traffic doesn't notice
- A MongoDB restore from yesterday's backup completes in under an hour
- New deploys roll out with zero downtime
- Alerts fire before customers complain, not after
If you're not there yet, prioritize in that order: log aggregation first (you can't fix what you can't see), then health checks + autoscaling, then backups, then performance tuning.