Appmint - AI That Builds & Runs Your Entire Business

A self-hosted AppEngine in production needs the same operational machinery as any other Node-plus-database service. This page is the runbook companion — what to monitor, what to alert on, when to scale. The source for the policy below is appengine/PRODUCTION_OPERATIONS_CONTROL_GUIDE.md.

What to monitor

Health endpoints

AppEngine exposes a small monitoring API:

Endpoint	Returns
`GET /monitoring/health`	Overall: status, mongo state, redis state
`GET /monitoring/overview`	System overview dashboard data
`GET /monitoring/system-metrics`	CPU, memory, event-loop lag
`GET /monitoring/queues`	BullMQ queue depths, processing rates
`GET /monitoring/alerts`	Recent alert history
`GET /monitoring/user-activity`	Activity counters
`GET /monitoring/usage`	Per-org usage stats

Wire these to whatever you use — Prometheus scraper, Datadog HTTP check, simple uptime ping. The default is to alert on any /monitoring/health response that isn't 200 {status: "ok"}.

Key metrics

Metric	Source	Alert threshold
API response time p95	response-time interceptor	>500ms sustained 5min
Error rate	logs	>1% over 5min
Event loop lag	system-metrics	>100ms
Heap usage	system-metrics	>1.5 GB
MongoDB query time p95	mongo profiler	>100ms
Redis hit rate	redis info	<70%
Queue depth	/monitoring/queues	>10000 jobs
Failed jobs	/monitoring/queues	>100/5min
Disk free	host	<10%
Cert expiry	TLS endpoint	<14 days

Auth-specific alerts

Authentication is the most attacked surface. Track:

Failed login attempts per IP — alert >10/min
Token refresh frequency per user — alert >5/min
JWT validation failures — alert >20/min
API key auth failures — alert >50/5min

The default rate limit is 100 req/min/IP (RATE_LIMIT_MAX). Tune based on your patterns.

Logging

AppEngine writes:

stdout — structured logs (JSON when NODE_ENV=production)
./error.log — fatal/error level
./logs/combined.log — everything

In Kubernetes, scrape stdout via your cluster log collector (Fluent Bit → Loki, or the cloud's native — CloudWatch, Stackdriver). On a single Docker host, use Docker's log driver (json-file, gelf, syslog, journald).

Critical log points the source guide calls out:

/src/users/auth/jwt.auth.guard.ts — JWT validation failures
/src/users/auth/apikey.auth.guard.ts — API key auth
/src/middlewares/current-user.middleware.ts — user context errors
Authentication events (every login attempt, token op)
Permission violations
Database connection failures
External-service failures (timeouts, rate limits)
Business-logic errors (validation, workflow)

Backups

MongoDB

Daily full + hourly oplog tail is the production standard.

Full daily (Kubernetes CronJob shown earlier in Kubernetes setup; equivalent on a single box):

mongodump \
  --uri "$MONGODB_CONN" \
  --gzip \
  --archive=/backup/mongo-$(date +%Y%m%d-%H%M%S).gz

Retain 14 days local, 90+ days off-site (S3 or equivalent with object versioning).

Hourly oplog for point-in-time recovery to within ~1 hour:

mongodump \
  --uri "$MONGODB_CONN" \
  --db local --collection oplog.rs \
  --query "{\"ts\":{\"\$gt\":{\"\$timestamp\":{\"t\":$(date -d '1 hour ago' +%s),\"i\":1}}}}" \
  --out /backup/oplog-$(date +%Y%m%d-%H)

Redis

Redis persistence is configured via --appendonly yes (the manifests in earlier pages set this). The AOF file gives you replay-from-disk on restart. For backups proper:

redis-cli --rdb /backup/redis-$(date +%Y%m%d).rdb

Redis holds ephemeral state (sessions, queues). It's tolerable to lose recent Redis state in a disaster — Mongo is the system of record. Don't backup Redis if it costs you anything.

Object storage

Whatever you use (S3, Spaces, R2, MinIO) — turn on versioning + lifecycle rules. The blast radius of bad code deleting files is too large to operate without it.

Test restoration

A backup you've never restored isn't a backup. Quarterly minimum:

Spin up a staging AppEngine pointing at an empty Mongo
Restore last night's dump (mongorestore --gzip --archive=...)
Boot AppEngine, verify root login, verify a known record exists
Time the whole thing — that's your RTO

If RTO is unacceptable (>4h for a 100GB DB), look into MongoDB Atlas managed backups or a hot-standby replica.

Scaling

When to scale AppEngine pods

Symptoms: response time creeping up, event-loop lag >100ms, CPU >70% sustained. AppEngine is mostly stateless — adding replicas is safe. Start at 3 replicas; add 1-2 more before peak periods.

kubectl scale deployment/appengine --replicas=5

If autoscaler is in place, raise the maxReplicas and let it grow automatically.

When to scale MongoDB

Symptoms: query time >100ms p95, working set spilling to disk, IOPS pegged. Options in order of disruption:

More RAM — keep working set in memory. Cheapest fix.
Faster disk — gp3 → io2 / NVMe. Doubles IOPS without code change.
Indexes — db.runCommand({collStats: 'collection'}) to find missing indexes. Run explain() on slow queries.
Replica set + read preference — split reads across secondaries. Requires changing MONGODB_CONN to a replica-set URI.
Sharding — split by org id. Major undertaking; do this only when single-replica capacity is exhausted.

When to scale Redis

Symptoms: command latency >5ms, memory >80%. Options:

More memory — maxmemory to actual RAM minus headroom.
Eviction policy — allkeys-lru if cache, noeviction if queues (default). Mixed — split into two Redis instances.
Cluster mode — split keyspace. AppEngine supports Redis Cluster URIs.

When to scale workers

Symptoms: queue depth growing, jobs taking longer than expected, monitoring/queues showing stalled jobs. Run a separate AppEngine deployment with worker-only env:

# worker deployment
env:
  - name: ENABLE_SYNC_PROCESSORS
    value: 'true'
  - name: ENABLE_SYNC_CONSUMERS
    value: 'true'
  - name: ENABLE_SYNC_JOBS
    value: 'true'
  # API not exposed; readiness probe on a different path or none

And turn those off on the API replicas. Lets you scale workers independently of API.

Common runbook entries

Authentication failure spike

Symptoms: alerts firing on failed logins. Steps:

Check /monitoring/alerts for the offender IPs.
Block at the ingress / firewall (Cloudflare, AWS WAF, etc.).
If distributed: enable rate limiting more aggressively (RATE_LIMIT_MAX=30).
Check whether the legitimate user pattern is impacted — if so, rotate strategy (CAPTCHA, MFA enforcement).
Audit logs for any successful logins from suspicious IPs in the same window.

MongoDB connection lost

Symptoms: 500s on every endpoint, /monitoring/health returns mongodb: down.

kubectl get pods -n appmint — is Mongo running?
kubectl logs mongodb-0 — connection errors? OOM?
Verify connection string in the secret (MONGODB_CONN).
If MongoDB is up but AppEngine can't reach: network policy? DNS?
Roll AppEngine pods (kubectl rollout restart deployment/appengine) — they reconnect on boot.

High memory usage

Symptoms: heap >1.5 GB, container OOM-killed.

Check /monitoring/system-metrics for the trend.
Take a heap dump: send SIGUSR2 to the process (Node.js feature) or use kubectl exec + the inspector.
Look for handle leaks: lsof -p <pid> | wc -l should be stable.
Check for runaway queries — a single query returning 1M docs without pagination is the usual culprit.
As a temporary fix, raise MEMORY_LIMIT and add a replica.

Queue backed up

Symptoms: /monitoring/queues showing >10k pending.

Identify the queue: bull-board UI is wired by default.
Are workers running? kubectl get pods -l role=worker.
Are jobs failing? Look at the failed-jobs list — usually a vendor API outage (Stripe, Twilio).
Scale workers up: kubectl scale deployment/appengine-worker --replicas=10.
If a specific job is poison-pill: drain it manually (Bull-board → remove job).

External service down

Symptoms: errors mentioning Twilio/Stripe/etc., usage /monitoring/usage shows external errors.

Check the vendor's status page first.
AppEngine retries with exponential backoff by default — most transient outages self-heal.
For prolonged outages, enable a feature flag (STRIPE_DISABLED=true) so the UI gracefully degrades.
Notify customers via the chat module's broadcast mechanism if customer-facing.

Daily, weekly, monthly tasks

Daily

Skim /monitoring/health and /monitoring/alerts
Verify last night's backup completed and was uploaded off-site
Check error log for new patterns

Weekly

Review API usage trends — anyone hitting quota limits?
Validate restore — at least a smoke test against a recent dump in a scratch env
Check certificate expiries (cert-manager, Let's Encrypt) for next 30 days
Review security-related logs for anomalies

Monthly

Full restore drill (timed)
Vulnerability scan: npm audit against your fork; trivy image against the deployed image
Review JWT secret age — rotate if >90 days
Capacity plan: project growth, decide on scaling moves
Review and tune monitoring thresholds based on the past 30 days

What good operations looks like

The on-call engineer can find any error in the logs within 60 seconds
An AppEngine pod can be killed by hand and traffic doesn't notice
A MongoDB restore from yesterday's backup completes in under an hour
New deploys roll out with zero downtime
Alerts fire before customers complain, not after

If you're not there yet, prioritize in that order: log aggregation first (you can't fix what you can't see), then health checks + autoscaling, then backups, then performance tuning.

Production operations