Full Build Explanation And AI Decision Log
Purpose
This document explains the current Earthbond build as it exists in this repository and live stack.
It is meant to answer four questions:
- what the platform currently does,
- which parts were directly requested by the user,
- where the assistant made engineering choices,
- what still needs to be built for deeper semantic processing.
This is a build-provenance document, not a marketing summary.
Authorship Record
For build-provenance purposes, this document records:
Johann F.R Wilhelmas the author and creator of the project direction, objective, and narrowing decisions that shaped the build.- assistant contributions as implementation, documentation, and engineering support provided inside that frame.
That means the build should be described as user-directed and assistant-implemented, not as a product concept that originated independently from the assistant.
Current Live State
As of 2026-03-30, the live demo_tenant Volve stack reports:
canonical_wells = 40well_locations = 34formation_tops = 409completion_intervals = 12production_records = 16160structural_surfaces = 92seismic_surveys = 14seismic_artifacts = 98reservoir_bodies = 78reservoir_penetrations = 78witsml_wellbores = 7witsml_trajectories = 14witsml_bha_runs = 15witsml_messages = 582witsml_support_wells = 8technical_reports = 1555coverage_rows = 18log_artifacts = 5344report_documents = 2model_artifacts = 5454package_artifacts = 5483semantic_wells = 40well_aliases = 496source_links = 1048
Coverage audit currently reports:
18of18staged Volve packages arecanonical_loaded
That means every staged package is represented in canonical tables and available to the application.
It does not mean every binary format has been fully semantically decoded down to native engineering meaning.
What Was Derived From The User
These requirements came directly from the user and should be treated as the primary product intent:
- build the system around real Volve data rather than hypothetical schemas,
- stage data for WAN-safe use through MinIO,
- validate before normalization,
- normalize all major data classes into canonical tables,
- rank reopening targets and estimate remaining barrels,
- keep the workflow visible in the UI,
- separate field-package workflow from legacy upload-session workflow,
- make outputs downloadable and documented,
- stop treating "minimum good enough" as the final standard,
- process all data, not only the highest-priority subsets.
Those requests are why the repository now contains field-package profiling, validation, phase-based normalization, canonical coverage audit, MinIO staging, scoring, volumetrics, and expanded artifact ingestion.
The user-directed brief was also broader than this list suggests.
It was an investigation into:
- whether open data alone could support a serious proof-of-concept,
- how difficult full canonicalization and cross-domain linkage would be in practice,
- what parts of the data would actually be useful for defendable payload-screening,
- what surfaced during development that changed the draft project structure,
- and what would still separate this POC from a production monetizable application.
Several important findings came out of that investigation and were then folded into the build:
- provenance and package coverage needed to become first-class features rather than background implementation detail,
- CRS/WGS84/ECEF normalization had to be treated as part of source truth,
- well identity and alias resolution became a core engineering problem,
- full artifact representation mattered even where full native semantic decode was not yet practical,
- ranking and barrel outputs had to be presented explicitly as heuristic decision-support layers.
Where The Assistant Made Engineering Choices
The assistant made implementation decisions in places where the user specified the goal but not the mechanism.
1. Storage model
Choice:
- stage Volve into MinIO and run the field-package workflow off
bucket + prefix
Reason:
- WAN-safe
- browser-independent
- reproducible
- avoids dependence on one host-local path
2. Database layout
Choice:
- one PostgreSQL cluster with schema separation instead of several separate PostgreSQL databases
Reason:
- lower operational complexity
- easier joins across raw, canonical, audit, semantic, and scoring layers
- sufficient for current scale
3. Canonical-first workflow
Choice:
- land each domain in canonical tables before downstream scoring
Reason:
- keeps traceability explicit
- avoids scoring directly from raw archives
- makes UI exploration and audits possible
4. Phase-based ingestion
Choice:
- split the field-package pipeline into validation, phase 1 normalization, phase 2 enrichments, seismic support, reservoir bodies, scoring, and volumetrics
Reason:
- lets the system run on incomplete packages
- isolates parser debt from core canonical load
- makes readiness visible
5. Coverage audit standard
Choice:
- define
canonical_loadedat the package/artifact level first
Reason:
- user required all data to be processed and represented
- some archives contain binaries that cannot be fully semantically decoded in the same pass
- package/artifact canonicalization is the minimum honest threshold for "represented and queryable"
6. Scoring and volumetric methods
Choice:
- use relative cross-well scoring and proxy low/mid/high barrel ranges
Reason:
- ranking and scoping were possible before full cell-level reservoir simulation parsing
- avoids false certainty
- keeps the system decision-support grade rather than pretending to be booked reserves
Build Chronology By Capability
A. Foundation
Core schemas, auth, upload/session flow, raw registry, CRS, audit, and semantic spine were added first.
Main files:
- 0001_extensions_and_base_schemas.py
- 0003_crs_tables.py
- 0004_audit_tables_and_immutability.py
- 0016_raw_layer_source_registry.py
- 0017_semantic_ontology_spine.py
B. Field-package profiling and validation
This layer inventories a real field package and decides what can be loaded.
Main files:
- field_package
__init__.py - field_validation
__init__.py - profile_volve_field_package.py
- validate_volve_field_package.py
C. Canonical field-package normalization
Phase 1 canonicalized wells, aliases, locations, production, tops, completions, and structural context.
Main files:
D. Phase-2 log and reservoir support
This added interpreted log outputs, pay events, bypassed candidates, and dynamic reservoir support.
Main files:
E. Technical, WITSML, seismic, reservoir-body, and full artifact ingest
This expanded processing from selective domains into full package coverage.
Main files:
- 0023_field_package_seismic_support.py
- 0024_reservoir_bodies_and_penetrations.py
- 0025_canonical_witsml_context.py
- 0026_field_package_coverage_and_technical_reports.py
- 0027_canonical_artifact_ingest.py
- 0028_canonical_package_artifacts.py
- field_technical
__init__.py - field_witsml
__init__.py - field_seismic
__init__.py - field_reservoir
__init__.py - field_logs_full
__init__.py - field_model_artifacts
__init__.py - field_reports
__init__.py - field_misc
__init__.py - field_coverage
__init__.py
F. Scoring and remaining barrels
These layers convert canonical evidence into ranked candidates and low/mid/high barrel ranges.
Main files:
- 0018_reopening_target_scoring.py
- 0022_remaining_barrel_estimates.py
- field_scoring
__init__.py - field_volumetrics
__init__.py
G. Application and deployment path
The workflow is exposed through the data-plane API, client upload page, WAN proxy, and MinIO staging utilities.
Main files:
- main.py
- upload.html
- upload.js
- field_package_storage
__init__.py - stage_volve_to_minio.py
- docker-compose.yml
How To Read "Full Processing"
There are four different meanings of "fully processed." They should not be conflated.
1. Coverage complete
Every package is known, validated, staged, and represented in canonical tables.
Current state:
- yes
2. Canonical domain complete
The important structured facts from a domain are loaded into typed canonical tables.
Current state:
- mostly yes across production, geophysical interpretations, WITSML subsets, technical daily reports, seismic support, reservoir support, logs, model artifacts, reports, and misc artifacts
3. Semantic decode complete
The native meaning of every binary/text format is deeply extracted.
Current state:
- no
Examples still not fully decoded:
- all
DLIS/LIScurves at curve/sample level - all
SEG-Ytraces and headers - all Eclipse cell/property/schedule semantics
- all RMS property realizations
- all technical HTML/PDF report semantics
4. Decision-complete
The system can answer the business question with strong confidence and low manual cleanup.
Current state:
- partially yes for ranking and screening
- not yes for reserve-grade or full engineering sign-off
What Was Computed Versus What Was Interpreted
Directly loaded from source data
- production dates and volumes
- well names and aliases from structured files
- tops and perforations
- structural surfaces
- WITSML XML objects
- report and artifact inventories
- archive member-level metadata
Deterministically derived
- alias normalization
- WGS84 and ECEF coordinates
- pay event intervals from interpreted logs
- bypassed-pay candidate flags
- package/domain coverage status
- per-well source links
Heuristic or model-based
- reopening score
- relative cross-well rank
- reservoir body support from mixed sources
- accessible-area proxies
- low/mid/high remaining barrel estimates
Those heuristic layers are deliberately kept separate from raw and canonical factual layers.
What Still Needs To Be Built
If the target is "deep full semantic processing," these remain:
- full
DLIS/LIScurve extraction and mnemonic/unit normalization, - full
SEG-Yheader/sample decode into canonical seismic trace objects, - deeper Eclipse parsing for schedules, vectors, and cell-level semantics,
- deeper RMS parsing for realization/property/cell semantics,
- structured parsing of technical HTML/PDF reports beyond XML daily reports,
- stronger well-to-compartment intersection logic using more model geometry and fewer proxies,
- explicit rights/attribution enforcement in download/export flows,
- formal repo licensing and third-party notice handling.
Teaching Standard For Future Work
When adding new capabilities, use this order:
- prove the source exists,
- validate domain coverage,
- define the canonical target table,
- preserve raw provenance,
- separate deterministic facts from heuristics,
- expose the result in the UI and API,
- add coverage audit status,
- document legal/license boundaries.
If a new build step does not satisfy those eight points, it is not finished.
Bottom Line
This build is no longer just a schema exercise.
It is now:
- a MinIO-backed field-package platform,
- with canonical coverage for all staged Volve packages,
- with scoring and volumetric outputs,
- and with explicit separation between facts, deterministic derivations, and heuristic decisions.
What it is not yet:
- a complete semantic decoder for every binary engineering format,
- or a reserve-booking system.