Deployment Healthcheck Runbook

Source: docs/operations/DEPLOYMENT_HEALTHCHECK_RUNBOOK.md

Manual Index Client UI

Deployment Healthcheck Runbook

Purpose

This runbook is the operator path for validating that the deployed Earthbond stack is:

  1. running,
  2. healthy,
  3. reachable through the intended local and WAN entrypoints,
  4. safe to keep online after configuration changes.

The healthcheck script is:

What the script checks

Container state

The script inspects:

  1. earthbond-postgres
  2. earthbond-minio
  3. earthbond-control-plane
  4. earthbond-data-plane
  5. earthbond-api-gateway
  6. earthbond-admin-web
  7. earthbond-client-web
  8. earthbond-edge-web
  9. earthbond-public-proxy
  10. earthbond-worker-ingest
  11. earthbond-worker-crs
  12. earthbond-worker-audit

It records:

  1. Docker runtime status
  2. healthcheck status where available

Local endpoint checks

The script validates:

  1. http://127.0.0.1:8080/healthz
  2. http://127.0.0.1:8081/healthz
  3. http://127.0.0.1:8082/healthz
  4. http://127.0.0.1:3101/healthz
  5. http://127.0.0.1:3100/healthz
  6. http://127.0.0.1:19000/minio/health/live
  7. https://127.0.0.1/

WAN endpoint check

The script also checks:

  1. ${PUBLIC_BASE_URL}/

This confirms the public web entrypoint is still serving.

Standard run


cd "/Users/robertwilhelm/Documents/New project"
./scripts/ops/check_stack_health.sh

Safe remediation mode


cd "/Users/robertwilhelm/Documents/New project"
./scripts/ops/check_stack_health.sh --attempt-fix

This remediation mode only restarts non-stateful services when they are:

  1. not running, or
  2. marked unhealthy

It does not attempt automatic repair for:

  1. PostgreSQL
  2. MinIO

That is intentional. Stateful systems should not be restarted automatically as a first reaction.

Interpretation

Good result

All health checks passed.

Bad result

The script exits non-zero and prints:

  1. failed container states
  2. failed endpoints

Treat any of these as release blockers:

  1. earthbond-data-plane unhealthy
  2. earthbond-api-gateway unhealthy
  3. public proxy endpoint failure
  4. MinIO health failure
  5. PostgreSQL container not running

After a compose or secret change

Run in this order:


docker compose up -d --build
./scripts/ops/check_stack_health.sh

If failures remain:


./scripts/ops/check_stack_health.sh --attempt-fix

If failures still remain after remediation:

  1. inspect docker compose ps
  2. inspect docker compose logs --tail=200 <service>
  3. stop and correct configuration before reopening public access

Security expectations

The deployment is only acceptable when:

  1. .env contains non-default secrets
  2. ADMIN_GATE_PASSWORD_HASH is set explicitly
  3. AUTH_TOKEN_SECRET is set explicitly
  4. POSTGRES_PASSWORD and MINIO_ROOT_PASSWORD are not defaults
  5. S3_PUBLIC_ENDPOINT points at the intended reachable object endpoint

Scope

This script is an operational gate, not a substitute for:

  1. backups
  2. vulnerability scanning
  3. patch management
  4. log review
  5. application-level correctness testing