Documentation

Monitoring

Health checks, queue stats, system metrics, alerts, user activity, and operational dashboards.

The Monitoring module exposes operational visibility — health checks, system metrics, queue depth, alerts, user activity, and platform-wide rollups for company creation and domain mappings. Endpoints sit at /monitoring/* and are unauthenticated by design so external uptime checks and dashboards don't need to manage credentials.

Health checks

GET/monitoring/healthNo auth

The canonical liveness probe — what Kubernetes hits, what your status page polls. Runs every component check (DB connectivity, Redis, queue, vendor connectivity for the configured integrations) and returns aggregate status:

{
  "status": "ok",
  "info": {
    "mongodb": { "status": "up" },
    "redis": { "status": "up" },
    "queue": { "status": "up" },
    "stripe": { "status": "up" }
  },
  "error": {},
  "details": { ... }
}

A 503 response means at least one component is down; the response body identifies which one.

This endpoint is @PublicRoute() and rate-limit-skipped — high-frequency probes don't get throttled.

System overview

GET/monitoring/overviewNo auth
GET/monitoring/system-metricsNo auth

overview is the aggregate dashboard — combines health, queue stats, system metrics, and recent alerts in one round-trip. system-metrics returns CPU, memory, event-loop lag, request count, error rate per process. Both endpoints power the operations dashboard at monitoring/ui/.

Historical metrics

GET/monitoring/historicalNo auth

?range=1h|24h|7d|30d. Returns time-series data for the same metrics — request volume, error rate, queue depth, response time over the chosen window. Used for trend charts.

Queue monitoring

GET/monitoring/queuesNo auth
GET/monitoring/queues/:queueNameNo auth

queues returns a row per queue: depth, in-flight, completed, failed, throughput. The Sync module's queues (datatype, one-off, schedule, social-sync, notification, escalation, billing) all show up here. :queueName returns detailed per-queue stats including the most recent failed jobs and their error messages — useful for debugging without tailing application logs.

Alerts and notifications

GET/monitoring/alertsNo auth
GET/monitoring/alert-notificationsNo auth

alerts returns recent in-process alerts (queue depth high, integration expired, AI provider rate-limited). alert-notifications returns the broader cross-org notification feed — what was sent through the broadcast module to which audiences across shared_org and root_org.

User activity

GET/monitoring/user-activityNo auth

Live user-activity feed: signed-in users right now, recent logins, top routes by traffic. Used by the operations dashboard to show "who's using the platform right now".

Platform-wide metrics (root-org)

These are root-org rollups — visibility across all orgs on the platform, not per-org. They're public for the operator dashboard but most data is anonymised (counts, not names).

GET/monitoring/company-creationNo auth
GET/monitoring/domain-mappingsNo auth
GET/monitoring/usageNo auth
GET/monitoring/web-activityNo auth

company-creation tracks new org signups per day. domain-mappings tracks how many domains have been mapped to AppEngine sites. usage rolls up platform-wide usage and cost (across Usage and pricing). web-activity aggregates page views and visitor counts.

Operational dashboard

A bundled dashboard ships at /monitoring/ui/index.html (served from src/monitoring/ui/). It calls the endpoints above and renders the live state. Open it in any browser to inspect the platform without setting up Grafana — useful for support, on-call, and small-team operations.

For larger operations, the JSON endpoints feed any standard tool: Datadog, New Relic, Grafana with a JSON datasource, custom internal dashboards.

Alerts and notifications wiring

The Monitoring module emits alerts via the Sync notification processor. To configure where alerts land:

  • Email — set monitoring.alertEmail in the org config.
  • Slack — connect Slack via the integrations module and set monitoring.alertSlackChannel.
  • PagerDuty — connect via webhook URL.

Critical alerts (DB down, queue stuck > 30 minutes, AI provider error rate > 50%) page the on-call rotation. Warning alerts (slow query, queue depth > threshold) email but don't page.

Logs

The Monitoring module surfaces summary metrics; for raw logs, use the platform's logging pipeline (Winston by default, configurable to Loki/CloudWatch/etc.). Each request log line includes orgid, userId (or customerId), endpoint, duration, and status so log-side filtering matches the metric breakdown.

Why public auth?

Health and metric endpoints are deliberately unauthenticated. This is a tradeoff:

  • Pro: external probes, status pages, and ops tools don't need credentials, eliminating one source of "alerts firing because the auth token expired".
  • Con: anyone can read metric data. The data is non-PII (queue depth, request counts) and considered safe to expose.

If your org's policy requires authenticated metrics, put a reverse-proxy (Cloudflare Access, an authenticated Nginx) in front of /monitoring/* and gate access at the proxy.

What this module is not

  • Not the per-org analytics surface — that's the Analytics module at /analytics/*.
  • Not the audit log — that's the Activities and audit tracking layer.
  • Not application-level error tracking — pair with Sentry or an APM for stack-trace-level errors.

For per-org dashboards (showing one customer's usage, queue activity, and integration health), the org-management module exposes scoped endpoints. The Monitoring module is platform-operator-facing.