CA Well Test Pack Guide

Purpose

Build a real California test pack for oil/drilling prioritization and audit-ready POC validation.

The pack includes:

CalGEM WellSTAR oil/gas wells (county-filtered)
CalGEM WellSTAR well-stimulation records (drilling activity)
CA DWR well completion reports
CA DWR geologic log intervals
CA DWR well report PDF links
USGS California groundwater sites + groundwater level sample

Run

From repository root:


python3 scripts/dev/fetch_ca_well_test_pack.py

Or via make:


make fetch-ca-well-pack

Build stage-2 OCR queue from the fetched links:


python3 scripts/dev/build_ca_well_stage2_queue.py

Run stage-2 link status checks (pending/failed/successful):


python3 scripts/dev/run_ca_stage2_download_status.py \
  --pack-root data/external/ca_well_test_pack \
  --max-check 250

Run stage-2 OCR extraction status checks (moves OCR from pending -> successful/failed):


python3 scripts/dev/run_ca_stage2_ocr_status.py \
  --pack-root data/external/ca_well_test_pack \
  --max-check 250 \
  --max-pages 2 \
  --min-chars 40

Run full API workflow (upload + register-job execution + report):


python3 scripts/dev/run_ca_well_pack_api_workflow.py

Run full self-validation with expected-vs-actual comparison:


python3 scripts/dev/run_well_poc_self_validation.py \
  --source-tag ca_well_test_pack_compare \
  --stage2-max-links 60 \
  --stage2-status-max-check 60 \
  --expected-outcome docs/operations/WELL_POC_EXPECTED_BASELINE_CA_RUN.json \
  --workflow-output docs/operations/CA_WELL_TEST_PACK_INGEST_REPORT_COMPARE.json \
  --output-json docs/operations/WELL_POC_SELF_VALIDATION_RESULT_COMPARE.json \
  --output-md docs/operations/WELL_POC_SELF_VALIDATION_RESULT_COMPARE.md

Run audit alignment validation (processed stage1 vs audit manifest):


python3 scripts/dev/validate_ca_well_audit_alignment.py \
  --audit-manifest data/external/ca_well_test_pack/audit/manifest.json \
  --workflow-report docs/operations/CA_WELL_TEST_PACK_INGEST_REPORT_COMPARE_20260303.json \
  --output-json docs/operations/CA_WELL_AUDIT_ALIGNMENT_REPORT.json \
  --output-md docs/operations/CA_WELL_AUDIT_ALIGNMENT_REPORT.md

Default output:

data/external/ca_well_test_pack

Typical Drilling-Focused Run


python3 scripts/dev/fetch_ca_well_test_pack.py \
  --county Kern \
  --county Ventura \
  --well-type OG \
  --well-type DG \
  --max-wells 12000 \
  --max-wst 12000 \
  --max-wcr-rows 15000 \
  --max-geologic-rows 15000 \
  --max-pdf-link-rows 15000

Outputs

normalized/:

calgem_wells_oil_gas.csv
calgem_wells_oil_gas.jsonl
calgem_wst_stimulation.csv
calgem_wst_stimulation.jsonl
dwr_well_completion_reports.csv
dwr_well_completion_reports.jsonl
dwr_geologic_log_intervals.csv
dwr_geologic_log_intervals.jsonl
dwr_well_report_links.csv
dwr_well_report_links.jsonl
usgs_groundwater_sites.csv
usgs_groundwater_sites.jsonl
usgs_groundwater_levels_sample.json

audit/:

manifest.json (row counts, checks, hashes, source URLs)
summary.md (quick human-readable status)
poc_ingest_manifest.json (suggested ingest order for POC workflow)

stage2/:

pdf_ocr_queue.csv
pdf_ocr_queue.jsonl
pdf_ocr_queue_summary.json
pdf_ocr_status_latest.json
pdf_ocr_ocr_latest.json
status_runs/download_status_<timestamp>.json
status_runs/ocr_status_<timestamp>.json

docs/operations/:

CA_WELL_TEST_PACK_INGEST_REPORT.json (API workflow execution report)
WELL_POC_EXPECTED_BASELINE_CA_RUN.json (expected baseline for comparison runs)
WELL_POC_SELF_VALIDATION_RESULT*.json|.md (captured-vs-processed comparison outputs)
CA_WELL_AUDIT_ALIGNMENT_REPORT.json|.md (audit manifest alignment report)

Validation Included

Required field completeness checks by dataset.
Coordinate checks:
coordinate presence
California bounds plausibility (lat 31..43, lon -125..-113)
Geologic log interval sanity:
INTERVALSTART <= INTERVALEND
File integrity:
SHA256 + size for each output artifact.

Notes

This tool intentionally pulls real public records and writes sampled datasets for rapid POC iteration.
Increase limits when you need larger coverage for model training or stress tests.
In the client UI (http://localhost:3101), open **Well Data Processing Visibility** to inspect stage1/stage2 datasets and profile rows/columns stored by register jobs.
Live workflow reflection in UI:
leave source tag empty to watch all runs, or set a specific run tag.
keep **Live API sync** enabled and set interval (2-120s) for near real-time updates.
summary cards show stage split, register-job completion, and stage2 download/OCR status totals.
Audit validation in UI:
click **Run Audit Validation** to compare processed stage1 data to audit manifest hashes/row counts.
Docker default should use **Audit manifest object key** (S3/MinIO), not host file path.
output formats are generated as both JSON and Markdown and can be downloaded with **Download Audit JSON/MD**.
generated report artifacts are stored under demo_tenant/reports/audit_validation/<source_tag>/.
Status meanings:
pending: queued but not yet checked/downloaded
successful: link check/download probe succeeded
failed: request failed or HTTP status not in success/redirect range
OCR status meanings:
pending: queued but OCR not attempted yet
successful: PDF text extracted (minimum character threshold passed)
failed: OCR could not run (non-PDF response, download failure, or extraction produced no text)