Deployment Healthcheck Runbook
Purpose
This runbook is the operator path for validating that the deployed Earthbond stack is:
- running,
- healthy,
- reachable through the intended local and WAN entrypoints,
- safe to keep online after configuration changes.
The healthcheck script is:
scripts/ops/check_stack_health.sh
What the script checks
Container state
The script inspects:
earthbond-postgresearthbond-minioearthbond-control-planeearthbond-data-planeearthbond-api-gatewayearthbond-admin-webearthbond-client-webearthbond-edge-webearthbond-public-proxyearthbond-worker-ingestearthbond-worker-crsearthbond-worker-audit
It records:
- Docker runtime status
- healthcheck status where available
Local endpoint checks
The script validates:
http://127.0.0.1:8080/healthzhttp://127.0.0.1:8081/healthzhttp://127.0.0.1:8082/healthzhttp://127.0.0.1:3101/healthzhttp://127.0.0.1:3100/healthzhttp://127.0.0.1:19000/minio/health/livehttps://127.0.0.1/
WAN endpoint check
The script also checks:
${PUBLIC_BASE_URL}/
This confirms the public web entrypoint is still serving.
Standard run
cd "/Users/robertwilhelm/Documents/New project"
./scripts/ops/check_stack_health.sh
Safe remediation mode
cd "/Users/robertwilhelm/Documents/New project"
./scripts/ops/check_stack_health.sh --attempt-fix
This remediation mode only restarts non-stateful services when they are:
- not running, or
- marked unhealthy
It does not attempt automatic repair for:
- PostgreSQL
- MinIO
That is intentional. Stateful systems should not be restarted automatically as a first reaction.
Interpretation
Good result
All health checks passed.
Bad result
The script exits non-zero and prints:
- failed container states
- failed endpoints
Treat any of these as release blockers:
earthbond-data-planeunhealthyearthbond-api-gatewayunhealthy- public proxy endpoint failure
- MinIO health failure
- PostgreSQL container not running
After a compose or secret change
Run in this order:
docker compose up -d --build
./scripts/ops/check_stack_health.sh
If failures remain:
./scripts/ops/check_stack_health.sh --attempt-fix
If failures still remain after remediation:
- inspect
docker compose ps - inspect
docker compose logs --tail=200 <service> - stop and correct configuration before reopening public access
Security expectations
The deployment is only acceptable when:
.envcontains non-default secretsADMIN_GATE_PASSWORD_HASHis set explicitlyAUTH_TOKEN_SECRETis set explicitlyPOSTGRES_PASSWORDandMINIO_ROOT_PASSWORDare not defaultsS3_PUBLIC_ENDPOINTpoints at the intended reachable object endpoint
Scope
This script is an operational gate, not a substitute for:
- backups
- vulnerability scanning
- patch management
- log review
- application-level correctness testing