Skip to main content

Phase 6: Observability

Deploy the observability stack -- Grafana Alloy, Loki, Prometheus, and Grafana -- so that every service has centralized logging, metrics collection, and dashboards accessible at grafana.redactedworld.com.

Prerequisites

  • Phase 1 -- Traefik must be routing traffic so the Grafana IngressRoute can be created.
  • Phase 5 -- Scan services should be running to provide meaningful metrics and logs to observe (though the stack can be deployed earlier for general cluster monitoring).

Blocks

IDBlockDescriptionAcceptance Criteria
6.1Deploy Grafana Alloy + LokiDeploy Grafana Alloy as a DaemonSet to collect container logs from every node and ship them to a Loki instance with persistent storage.kubectl logs equivalent queries work in Grafana Explore via the Loki data source; logs from all namespaces are indexed.
6.2Deploy PrometheusDeploy Prometheus via Helm with service discovery for all K8s pods, persistent storage, and a 30-day retention policy.promql queries return metrics for cluster nodes, pods, and custom application metrics; Prometheus UI is accessible internally.
6.3Deploy GrafanaDeploy Grafana with persistent storage, configure Loki and Prometheus as data sources, and expose it at grafana.redactedworld.com behind Traefik with Keycloak OAuth2 proxy for SSO.Grafana loads at the public URL; users log in via Keycloak; both data sources show "connected" in the data source settings.
6.4Service metrics instrumentationAdd Prometheus client libraries to every NestJS service (api-gateway, auth, user, org, chat, forum, notification, domain, scan, report) exposing /metrics with standard HTTP, gRPC, and business metrics.Each service's /metrics endpoint returns valid Prometheus exposition format; Prometheus scrapes all targets successfully.
6.5Dashboards (grafana.redactedworld.com)Create pre-built Grafana dashboards: Cluster Overview, Service Health, Scan Pipeline (active jobs, durations, failure rates), and NATS Throughput. Export dashboards as JSON and store in version control.All four dashboards render with live data; dashboard JSON files are committed to the repository under grafana/dashboards/.

Estimated Scope

AreaFiles / Resources
Helm chartsk8s/grafana-alloy/, k8s/loki/, k8s/prometheus/, k8s/grafana/
Service instrumentationservices/*/src/metrics/ (Prometheus client setup in each NestJS service)
Grafana dashboardsgrafana/dashboards/cluster-overview.json, service-health.json, scan-pipeline.json, nats-throughput.json
Grafana provisioninggrafana/provisioning/datasources/, grafana/provisioning/dashboards/
KeycloakOAuth2 client registration for Grafana SSO
KubernetesIngressRoute for grafana.redactedworld.com