Full Build Explanation And AI Decision Log

Purpose

This document explains the current Earthbond build as it exists in this repository and live stack.

It is meant to answer four questions:

what the platform currently does,
which parts were directly requested by the user,
where the assistant made engineering choices,
what still needs to be built for deeper semantic processing.

This is a build-provenance document, not a marketing summary.

Authorship Record

For build-provenance purposes, this document records:

Johann F.R Wilhelm as the author and creator of the project direction, objective, and narrowing decisions that shaped the build.
assistant contributions as implementation, documentation, and engineering support provided inside that frame.

That means the build should be described as user-directed and assistant-implemented, not as a product concept that originated independently from the assistant.

Current Live State

As of 2026-03-30, the live demo_tenant Volve stack reports:

canonical_wells = 40
well_locations = 34
formation_tops = 409
completion_intervals = 12
production_records = 16160
structural_surfaces = 92
seismic_surveys = 14
seismic_artifacts = 98
reservoir_bodies = 78
reservoir_penetrations = 78
witsml_wellbores = 7
witsml_trajectories = 14
witsml_bha_runs = 15
witsml_messages = 582
witsml_support_wells = 8
technical_reports = 1555
coverage_rows = 18
log_artifacts = 5344
report_documents = 2
model_artifacts = 5454
package_artifacts = 5483
semantic_wells = 40
well_aliases = 496
source_links = 1048

Coverage audit currently reports:

18 of 18 staged Volve packages are canonical_loaded

That means every staged package is represented in canonical tables and available to the application.

It does not mean every binary format has been fully semantically decoded down to native engineering meaning.

What Was Derived From The User

These requirements came directly from the user and should be treated as the primary product intent:

build the system around real Volve data rather than hypothetical schemas,
stage data for WAN-safe use through MinIO,
validate before normalization,
normalize all major data classes into canonical tables,
rank reopening targets and estimate remaining barrels,
keep the workflow visible in the UI,
separate field-package workflow from legacy upload-session workflow,
make outputs downloadable and documented,
stop treating "minimum good enough" as the final standard,
process all data, not only the highest-priority subsets.

Those requests are why the repository now contains field-package profiling, validation, phase-based normalization, canonical coverage audit, MinIO staging, scoring, volumetrics, and expanded artifact ingestion.

The user-directed brief was also broader than this list suggests.

It was an investigation into:

whether open data alone could support a serious proof-of-concept,
how difficult full canonicalization and cross-domain linkage would be in practice,
what parts of the data would actually be useful for defendable payload-screening,
what surfaced during development that changed the draft project structure,
and what would still separate this POC from a production monetizable application.

Several important findings came out of that investigation and were then folded into the build:

provenance and package coverage needed to become first-class features rather than background implementation detail,
CRS/WGS84/ECEF normalization had to be treated as part of source truth,
well identity and alias resolution became a core engineering problem,
full artifact representation mattered even where full native semantic decode was not yet practical,
ranking and barrel outputs had to be presented explicitly as heuristic decision-support layers.

Where The Assistant Made Engineering Choices

The assistant made implementation decisions in places where the user specified the goal but not the mechanism.

1. Storage model

Choice:

stage Volve into MinIO and run the field-package workflow off bucket + prefix

Reason:

WAN-safe
browser-independent
reproducible
avoids dependence on one host-local path

2. Database layout

Choice:

one PostgreSQL cluster with schema separation instead of several separate PostgreSQL databases

Reason:

lower operational complexity
easier joins across raw, canonical, audit, semantic, and scoring layers
sufficient for current scale

3. Canonical-first workflow

Choice:

land each domain in canonical tables before downstream scoring

Reason:

keeps traceability explicit
avoids scoring directly from raw archives
makes UI exploration and audits possible

4. Phase-based ingestion

Choice:

split the field-package pipeline into validation, phase 1 normalization, phase 2 enrichments, seismic support, reservoir bodies, scoring, and volumetrics

Reason:

lets the system run on incomplete packages
isolates parser debt from core canonical load
makes readiness visible

5. Coverage audit standard

Choice:

define canonical_loaded at the package/artifact level first

Reason:

user required all data to be processed and represented
some archives contain binaries that cannot be fully semantically decoded in the same pass
package/artifact canonicalization is the minimum honest threshold for "represented and queryable"

6. Scoring and volumetric methods

Choice:

use relative cross-well scoring and proxy low/mid/high barrel ranges

Reason:

ranking and scoping were possible before full cell-level reservoir simulation parsing
avoids false certainty
keeps the system decision-support grade rather than pretending to be booked reserves

Build Chronology By Capability

A. Foundation

Core schemas, auth, upload/session flow, raw registry, CRS, audit, and semantic spine were added first.

Main files:

B. Field-package profiling and validation

This layer inventories a real field package and decides what can be loaded.

Main files:

C. Canonical field-package normalization

Phase 1 canonicalized wells, aliases, locations, production, tops, completions, and structural context.

Main files:

D. Phase-2 log and reservoir support

This added interpreted log outputs, pay events, bypassed candidates, and dynamic reservoir support.

Main files:

E. Technical, WITSML, seismic, reservoir-body, and full artifact ingest

This expanded processing from selective domains into full package coverage.

Main files:

F. Scoring and remaining barrels

These layers convert canonical evidence into ranked candidates and low/mid/high barrel ranges.

Main files:

G. Application and deployment path

The workflow is exposed through the data-plane API, client upload page, WAN proxy, and MinIO staging utilities.

Main files:

How To Read "Full Processing"

There are four different meanings of "fully processed." They should not be conflated.

1. Coverage complete

Every package is known, validated, staged, and represented in canonical tables.

Current state:

2. Canonical domain complete

The important structured facts from a domain are loaded into typed canonical tables.

Current state:

mostly yes across production, geophysical interpretations, WITSML subsets, technical daily reports, seismic support, reservoir support, logs, model artifacts, reports, and misc artifacts

3. Semantic decode complete

The native meaning of every binary/text format is deeply extracted.

Current state:

Examples still not fully decoded:

all DLIS/LIS curves at curve/sample level
all SEG-Y traces and headers
all Eclipse cell/property/schedule semantics
all RMS property realizations
all technical HTML/PDF report semantics

4. Decision-complete

The system can answer the business question with strong confidence and low manual cleanup.

Current state:

partially yes for ranking and screening
not yes for reserve-grade or full engineering sign-off

What Was Computed Versus What Was Interpreted

Directly loaded from source data

production dates and volumes
well names and aliases from structured files
tops and perforations
structural surfaces
WITSML XML objects
report and artifact inventories
archive member-level metadata

Deterministically derived

alias normalization
WGS84 and ECEF coordinates
pay event intervals from interpreted logs
bypassed-pay candidate flags
package/domain coverage status
per-well source links

Heuristic or model-based

reopening score
relative cross-well rank
reservoir body support from mixed sources
accessible-area proxies
low/mid/high remaining barrel estimates

Those heuristic layers are deliberately kept separate from raw and canonical factual layers.

What Still Needs To Be Built

If the target is "deep full semantic processing," these remain:

full DLIS/LIS curve extraction and mnemonic/unit normalization,
full SEG-Y header/sample decode into canonical seismic trace objects,
deeper Eclipse parsing for schedules, vectors, and cell-level semantics,
deeper RMS parsing for realization/property/cell semantics,
structured parsing of technical HTML/PDF reports beyond XML daily reports,
stronger well-to-compartment intersection logic using more model geometry and fewer proxies,
explicit rights/attribution enforcement in download/export flows,
formal repo licensing and third-party notice handling.

Teaching Standard For Future Work

When adding new capabilities, use this order:

prove the source exists,
validate domain coverage,
define the canonical target table,
preserve raw provenance,
separate deterministic facts from heuristics,
expose the result in the UI and API,
add coverage audit status,
document legal/license boundaries.

If a new build step does not satisfy those eight points, it is not finished.

Bottom Line

This build is no longer just a schema exercise.

It is now:

a MinIO-backed field-package platform,
with canonical coverage for all staged Volve packages,
with scoring and volumetric outputs,
and with explicit separation between facts, deterministic derivations, and heuristic decisions.

What it is not yet:

a complete semantic decoder for every binary engineering format,
or a reserve-booking system.