CA Well Test Pack Guide
Purpose
Build a real California test pack for oil/drilling prioritization and audit-ready POC validation.
The pack includes:
- CalGEM WellSTAR oil/gas wells (county-filtered)
- CalGEM WellSTAR well-stimulation records (drilling activity)
- CA DWR well completion reports
- CA DWR geologic log intervals
- CA DWR well report PDF links
- USGS California groundwater sites + groundwater level sample
Run
From repository root:
python3 scripts/dev/fetch_ca_well_test_pack.py
Or via make:
make fetch-ca-well-pack
Build stage-2 OCR queue from the fetched links:
python3 scripts/dev/build_ca_well_stage2_queue.py
Run stage-2 link status checks (pending/failed/successful):
python3 scripts/dev/run_ca_stage2_download_status.py \
--pack-root data/external/ca_well_test_pack \
--max-check 250
Run stage-2 OCR extraction status checks (moves OCR from pending -> successful/failed):
python3 scripts/dev/run_ca_stage2_ocr_status.py \
--pack-root data/external/ca_well_test_pack \
--max-check 250 \
--max-pages 2 \
--min-chars 40
Run full API workflow (upload + register-job execution + report):
python3 scripts/dev/run_ca_well_pack_api_workflow.py
Run full self-validation with expected-vs-actual comparison:
python3 scripts/dev/run_well_poc_self_validation.py \
--source-tag ca_well_test_pack_compare \
--stage2-max-links 60 \
--stage2-status-max-check 60 \
--expected-outcome docs/operations/WELL_POC_EXPECTED_BASELINE_CA_RUN.json \
--workflow-output docs/operations/CA_WELL_TEST_PACK_INGEST_REPORT_COMPARE.json \
--output-json docs/operations/WELL_POC_SELF_VALIDATION_RESULT_COMPARE.json \
--output-md docs/operations/WELL_POC_SELF_VALIDATION_RESULT_COMPARE.md
Run audit alignment validation (processed stage1 vs audit manifest):
python3 scripts/dev/validate_ca_well_audit_alignment.py \
--audit-manifest data/external/ca_well_test_pack/audit/manifest.json \
--workflow-report docs/operations/CA_WELL_TEST_PACK_INGEST_REPORT_COMPARE_20260303.json \
--output-json docs/operations/CA_WELL_AUDIT_ALIGNMENT_REPORT.json \
--output-md docs/operations/CA_WELL_AUDIT_ALIGNMENT_REPORT.md
Default output:
data/external/ca_well_test_pack
Typical Drilling-Focused Run
python3 scripts/dev/fetch_ca_well_test_pack.py \
--county Kern \
--county Ventura \
--well-type OG \
--well-type DG \
--max-wells 12000 \
--max-wst 12000 \
--max-wcr-rows 15000 \
--max-geologic-rows 15000 \
--max-pdf-link-rows 15000
Outputs
normalized/:
calgem_wells_oil_gas.csvcalgem_wells_oil_gas.jsonlcalgem_wst_stimulation.csvcalgem_wst_stimulation.jsonldwr_well_completion_reports.csvdwr_well_completion_reports.jsonldwr_geologic_log_intervals.csvdwr_geologic_log_intervals.jsonldwr_well_report_links.csvdwr_well_report_links.jsonlusgs_groundwater_sites.csvusgs_groundwater_sites.jsonlusgs_groundwater_levels_sample.json
audit/:
manifest.json(row counts, checks, hashes, source URLs)summary.md(quick human-readable status)poc_ingest_manifest.json(suggested ingest order for POC workflow)
stage2/:
pdf_ocr_queue.csvpdf_ocr_queue.jsonlpdf_ocr_queue_summary.jsonpdf_ocr_status_latest.jsonpdf_ocr_ocr_latest.jsonstatus_runs/download_status_<timestamp>.jsonstatus_runs/ocr_status_<timestamp>.json
docs/operations/:
CA_WELL_TEST_PACK_INGEST_REPORT.json(API workflow execution report)WELL_POC_EXPECTED_BASELINE_CA_RUN.json(expected baseline for comparison runs)WELL_POC_SELF_VALIDATION_RESULT*.json|.md(captured-vs-processed comparison outputs)CA_WELL_AUDIT_ALIGNMENT_REPORT.json|.md(audit manifest alignment report)
Validation Included
- Required field completeness checks by dataset.
- Coordinate checks:
- coordinate presence
- California bounds plausibility (
lat 31..43,lon -125..-113) - Geologic log interval sanity:
INTERVALSTART <= INTERVALEND- File integrity:
- SHA256 + size for each output artifact.
Notes
- This tool intentionally pulls real public records and writes sampled datasets for rapid POC iteration.
- Increase limits when you need larger coverage for model training or stress tests.
- In the client UI (
http://localhost:3101), open **Well Data Processing Visibility** to inspect stage1/stage2 datasets and profile rows/columns stored by register jobs. - Live workflow reflection in UI:
- leave source tag empty to watch all runs, or set a specific run tag.
- keep **Live API sync** enabled and set interval (2-120s) for near real-time updates.
- summary cards show stage split, register-job completion, and stage2 download/OCR status totals.
- Audit validation in UI:
- click **Run Audit Validation** to compare processed stage1 data to audit manifest hashes/row counts.
- Docker default should use **Audit manifest object key** (S3/MinIO), not host file path.
- output formats are generated as both JSON and Markdown and can be downloaded with **Download Audit JSON/MD**.
- generated report artifacts are stored under
demo_tenant/reports/audit_validation/<source_tag>/. - Status meanings:
pending: queued but not yet checked/downloadedsuccessful: link check/download probe succeededfailed: request failed or HTTP status not in success/redirect range- OCR status meanings:
pending: queued but OCR not attempted yetsuccessful: PDF text extracted (minimum character threshold passed)failed: OCR could not run (non-PDF response, download failure, or extraction produced no text)