Build Rationale, Schema, And IP Defense

Source: docs/architecture/BUILD_RATIONALE_SCHEMA_AND_IP_DEFENSE_MANUAL.html

Manual Index Client UI

Authorship And Rights Position

This documentation records Johann F.R Wilhelm as the author and creator of the project direction, problem framing, and intellectual build intent.

Assistant contributions in this repository are implementation assistance, documentation assistance, and engineering structure supplied within that user-directed frame.

This is an authorship and provenance statement for internal and colleague review. It does not replace formal legal registration or counsel review.

Build Position

This build is a real-data field-package processing system centered on Volve. It is not a toy schema exercise and it is not a reserve-booking engine.

It should be described explicitly as a user-directed proof-of-value / proof-of-concept built from open data, not as a production deployment and not as an integration into proprietary internal software.

The user defined the objective and kept narrowing the scope until the build had a concrete shape. The assistant implemented the mechanics needed to execute that scope.

The practical design rule was: preserve provenance, validate before normalization, canonicalize before scoring, and mark derived outputs honestly.

40 Canonical wells
16160 Production records
78 Reservoir bodies
18/18 Packages canonical_loaded

What Was Determined By Human Direction

These were not only isolated feature requests. They were the user-directed structure of the investigation and therefore count as human-directed build intent.

  • Use real Volve data instead of a hypothetical demo schema.
  • Validate all data before normalization.
  • Support WAN access through MinIO and the app UI.
  • Normalize all major data classes, not only high-priority subsets.
  • Keep the workflow explicit and visible in the UI.
  • Export outputs so the build can be inspected and defended.
  • Explain clearly what is human, what is derived, and what is AI.
  • Review IP and copyright exposure.

The user was also directing the build toward a harder question: whether open data could support a defendable payload-screening proof-of-concept, how difficult that would be, what could realistically be used, and what would still be missing before the result could become part of a production monetizable application.

These requirements therefore fixed the build boundary: use open data, keep secret software out of scope, make the result inspectable by colleagues, and optimize for defendability rather than novelty claims.

They also surfaced several draft-project conclusions that were added back into the build structure:

  • spatial truth needs explicit CRS/ECEF discipline,
  • identity resolution across wells and packages is a first-order problem,
  • “data present” is not the same as “decision-grade,”
  • artifact-level canonicalization matters even when deep native decode is incomplete,
  • ranking and volumetrics must be presented as heuristic outputs, not false certainties.

What Was Determined By Assistant Engineering Decisions

The user set the goals. The assistant chose the implementation mechanics.

  • One PostgreSQL cluster with schema separation instead of several independent databases.
  • MinIO object storage as the WAN-safe field-package source.
  • Phase-based pipeline: validate, normalize, enrich, score, estimate.
  • Canonical-first architecture instead of scoring from raw archives.
  • Coverage audit as an explicit package completeness ledger.
  • Rule-based scoring and low/mid/high volumetrics instead of pretending to have reserve-grade certainty.
  • Semantic entity projection from canonical records.

These are engineering choices, not user-authored domain facts. Another competent team could have built a comparable proof-of-concept from the same requirements. The value here is explicitness and traceability, not exclusivity.

Architecture Schematic

Source Packages

Volve files in local staging and MinIO

Validation Gate

Package inventory, readiness, parser debt, coverage

Canonical Normalization

Wells, tops, completions, production, seismic, WITSML, reports, artifacts

Derived Layers

QC, pay events, bypassed candidates, reservoir bodies, penetrations

Decision Outputs

Reopening ranking and remaining-barrel estimates

UI / WAN Access Upload page, manuals, admin workflow, MinIO prefix mode
Data Plane Validation, normalization, export, scoring, coverage audit
Storage MinIO for package staging, PostgreSQL for canonical and audit state

How The Database Is Used

Schema Purpose Why It Exists
raw Source bundle/object registry Proves what entered the system and from where.
ops Canonical operational truth Holds typed domain records used by the application and analysis logic.
semantic Ontology projection Creates entity/relationship structure for later query and AI use.
audit Traceability and evidence Allows the build to be defended and reproduced.
core Users, permissions, tenancy Separates platform control state from subsurface domain state.

The database is therefore not just storage. It is the structure that preserves provenance, reproducibility, and separation between source facts and derived outputs.

It is also how CRS normalization and ECEF/WGS84 transforms become defendable source-of-truth fields rather than hidden script behavior.

High-Level Schema Overview

Layer Main Tables What They Mean
Canonical identity ops.canonical_wells, ops.canonical_well_aliases, ops.canonical_well_locations Stable well master, aliases, spatial anchors.
Static subsurface ops.canonical_formation_tops, ops.canonical_completion_intervals, ops.canonical_structural_surfaces Tops, perforations, and structural context.
Dynamic and technical ops.canonical_production_records, ops.canonical_dynamic_reservoir_context, ops.canonical_technical_daily_reports, ops.canonical_witsml_* Time-series production and technical operational context.
Logs and interpretation ops.well_logs, ops.well_qc_cards, ops.well_interpretations, ops.well_pay_events, ops.well_bypassed_candidates Log normalization, QC, pay intervals, and candidate logic.
Reservoir and seismic support ops.canonical_seismic_*, ops.canonical_reservoir_bodies, ops.canonical_well_reservoir_penetrations, ops.canonical_reservoir_model_artifacts Survey/model representation and reservoir-access logic.
Decision outputs ops.well_reopening_targets, ops.remaining_barrel_estimates Ranked decisions and low/mid/high barrel estimates.
Semantic projection semantic.entities, semantic.entity_links, semantic.entity_source_links Entity graph projected from canonical truth.

Human vs AI vs Derived Overview

Class Examples How To Defend It
Human source data production values, tops, completions, WITSML XML fields, technical reports Directly extracted from source packages.
Human-directed scope use Volve, use MinIO, process all packages, expose workflow, export results Product intent came from the user.
Assistant-designed implementation schema layout, migrations, workflow phases, scoring mechanics, coverage audit Engineering design choices made to satisfy user goals.
Deterministic transform WGS84/ECEF, counts, geometry envelopes, package inventories Mechanically derived from source facts and used to create a defendable spatial source of truth.
Rule-based heuristic scores, confidence labels, bypassed-pay candidates, remaining barrels Authored rules and weights, not runtime opaque model inference.
Runtime AI/ML use none found in canonicalization/scoring path Current live outputs are deterministic or heuristic, not live LLM decisions.

How We Get To The Conclusions

  1. Inventory packages and confirm what domains exist.
  2. Validate readiness and identify parser debt or partial coverage.
  3. Normalize source data into typed canonical tables.
  4. Link records across wells, logs, WITSML, reports, seismic, and models.
  5. Compute derived outputs such as QC, pay events, penetrations, and support scores.
  6. Compute decision-support outputs such as reopening targets and remaining barrels.

The conclusion is therefore evidence-driven. It is not “AI said so.” It is “source data was normalized, linked, and then processed by authored deterministic and heuristic logic.”

The product-origin boundary is equally important: this is a user-directed POC executed with assistant-authored implementation, not a product concept that originated independently from the assistant.

IP And Copyright Boundary

Area Current Position Remaining Gap
Repository code Traceable and documented No top-level LICENSE or rights statement yet.
Third-party libraries Bundled assets identified No central third-party notices file yet.
Dataset handling Provenance preserved and excerpts tightened License propagation into exports/downloads is not fully explicit.
AI authorship posture Build provenance documented No formal repo-level AI authorship policy yet.

Defense Summary For Colleagues

Use this exact line of argument:

  1. The build is grounded in real source data, not synthetic examples.
  2. The database separates raw source records from canonical truth and from derived outputs.
  3. Source-authored values are preserved with provenance.
  4. Derived values are explicitly marked as deterministic or heuristic.
  5. The system currently does not rely on runtime LLM/ML inference for canonicalization or ranking.
  6. The product direction came from Johann F.R Wilhelm; the assistant supplied implementation detail.
  7. The current barrel outputs are screening-grade heuristics, not booked reserves.
  8. IP posture is improving, but still requires repo license, third-party notices, and formal AI authorship policy.

Supporting Artifacts

Source-side artifacts for deeper validation:

  • docs/operations/BUILD_SCHEMA_PROVENANCE_AUDIT_20260330.json
  • docs/operations/BUILD_SCHEMA_PROVENANCE_AUDIT_20260330.md
  • docs/operations/BUILD_SCHEMA_PROVENANCE_AUDIT_20260330.csv