Earthbond Ontology Blueprint And AI Query Layer

Source: docs/architecture/EARTHBOND_ONTOLOGY_BLUEPRINT_AND_AI_QUERY_LAYER.md

Manual Index Client UI

Earthbond Ontology Blueprint And AI Query Layer

Purpose

This document defines the minimum viable ontology spine for Earthbond.

It is not a universal geoscience ontology.

It is the governed semantic layer required to:

  1. normalize heterogeneous well, drilling, spatial, and evidence data,
  2. preserve source-specific meaning without flattening everything into one table,
  3. expose stable entities and relationships for workflow logic,
  4. support AI retrieval and model execution against canonical facts instead of raw file chaos.

The ontology must be good enough for production workflows before it is broad enough for research.

Design Position

The platform should not begin with a giant ontology project.

It should begin with a workflow ontology centered on:

  1. well identity,
  2. spatial truth,
  3. log normalization,
  4. completion context,
  5. pay-event and candidate outputs,
  6. evidence and gap tracking.

That is enough to drive:

  1. deterministic workflow routing,
  2. audit-ready outputs,
  3. AI query and ranking,
  4. future expansion into additional source classes.

High-Level Semantic Architecture

flowchart TD A["Raw Source Schemas"] --> B["Source Bundle Registry"] B --> C["Canonical Entity Layer"] C --> D["Relationship Layer"] D --> E["Truth Index / Semantic Query Layer"] E --> F["Rules Engines"] E --> G["AI Retrieval + Model Context"] F --> H["Operational Outputs"] G --> H H --> I["Evidence Pack + Audit"]

Why This Ontology Is Needed

Without ontology, the system has:

  1. multiple source formats,
  2. inconsistent field names,
  3. conflicting identifiers,
  4. ambiguous CRS and vertical semantics,
  5. weak links between source evidence and derived outputs.

That makes AI unreliable because the model cannot distinguish:

  1. raw observations from canonical facts,
  2. preferred records from alternates,
  3. evidence from inference,
  4. confidence from certainty.

The ontology resolves that by giving the platform:

  1. stable entity definitions,
  2. controlled relationships,
  3. preferred-record selection,
  4. provenance links,
  5. query contracts for workflows and AI.

Ontology Layers

1. Source Schema Layer

This preserves source-native structure.

Examples:

  1. LAS header fields,
  2. DLIS metadata,
  3. scanned PDF OCR tables,
  4. CalGEM well fields,
  5. completion spreadsheets,
  6. survey CSV exports,
  7. point-cloud sidecar metadata.

This layer should not be forced into canonical names too early.

It belongs in raw.

2. Canonical Entity Layer

This is the first real semantic contract.

Each canonical entity must have:

  1. a stable entity_type,
  2. a stable entity_key,
  3. project scoping,
  4. provenance,
  5. confidence,
  6. preferred/not-preferred state.

This layer belongs in semantic.entities.

3. Relationship Layer

This expresses how entities connect.

Examples:

  1. a source object describes a well,
  2. a well has a location,
  3. a location uses a spatial reference,
  4. a pay event is derived from a log run,
  5. a candidate is proved by an evidence pack.

This layer belongs in semantic.entity_links.

4. Evidence Binding Layer

Every semantic entity that matters operationally must be linked back to:

  1. source object,
  2. source bundle,
  3. upload,
  4. authority rank,
  5. extraction method.

This layer belongs in semantic.entity_source_links.

5. Query Profile Layer

The platform needs a stable semantic query contract for:

  1. workflow modules,
  2. dashboards,
  3. AI retrieval,
  4. downstream models.

This belongs in semantic.query_profiles.

Minimum Canonical Entity Model

The initial ontology spine should include these entity types.

Control / project

  1. project

Raw/source

  1. source_bundle
  2. source_object

Identity / well master

  1. well
  2. well_identifier

Spatial

  1. well_location
  2. spatial_reference
  3. transform_step

Subsurface geometry

  1. trajectory
  2. survey_station

Petrophysics

  1. log_run
  2. curve

Geology / completion / production

  1. formation_top
  2. completion_interval
  3. production_record

Interpretation

  1. pay_event
  2. bypassed_pay_candidate

Governance / audit

  1. data_gap
  2. evidence_pack

This set is intentionally narrow.

It supports the current Earthbond POV without overcommitting to a huge ontology program.

Entity Semantics

well

Represents a canonical well identity across multiple source records.

It is not one file and not one regulator row.

well_location

Represents a location claim or resolved preferred location for a well.

It may be surface or bottom-hole.

It must carry CRS/datum/epoch/vertical semantics.

spatial_reference

Represents the formal CRS and vertical reference definition used by a location or transform.

This can describe:

  1. source CRS,
  2. resolved EPSG,
  3. datum realization,
  4. vertical datum,
  5. epoch,
  6. unit system.

transform_step

Represents a single documented transformation or normalization step.

Examples:

  1. source projected CRS -> WGS84 geodetic,
  2. geodetic + ellipsoidal height -> ECEF,
  3. MD -> TVDSS using minimum curvature.

log_run

Represents a specific well-log run with its source context.

This is distinct from the well itself.

curve

Represents a curve within a log run after mnemonic and unit normalization.

pay_event

Represents a derived subsurface interval of interest.

It is deterministic output, not raw source.

bypassed_pay_candidate

Represents a ranked review candidate.

It is downstream of:

  1. normalized logs,
  2. completion reconciliation,
  3. gap assessment,
  4. confidence scoring.

data_gap

Represents a structured missing, conflicting, or insufficient-data signal.

It must be queryable, not hidden in narrative notes.

evidence_pack

Represents the reproducibility contract.

It should link:

  1. source inputs,
  2. transform chain,
  3. formulas/cutoffs,
  4. outputs,
  5. audit metadata.

Relationship Model

The first relationship set should remain small and high-value.

flowchart LR A["Source Bundle"] -->|"bundle_contains_object"| B["Source Object"] B -->|"object_describes_well"| C["Well"] C -->|"well_has_identifier"| D["Well Identifier"] C -->|"well_has_location"| E["Well Location"] E -->|"location_uses_spatial_reference"| F["Spatial Reference"] E -->|"location_transformed_by"| G["Transform Step"] C -->|"well_has_trajectory"| H["Trajectory"] H -->|"trajectory_has_station"| I["Survey Station"] C -->|"well_has_log_run"| J["Log Run"] J -->|"log_run_has_curve"| K["Curve"] C -->|"well_has_formation_top"| L["Formation Top"] C -->|"well_has_completion_interval"| M["Completion Interval"] C -->|"well_has_production_record"| N["Production Record"] O["Pay Event"] -->|"pay_event_derived_from_log_run"| J P["Bypassed Pay Candidate"] -->|"candidate_derived_from_pay_event"| O P -->|"candidate_proved_by_evidence_pack"| Q["Evidence Pack"] C -->|"well_has_data_gap"| R["Data Gap"] P -->|"candidate_blocked_by_gap"| R

PostgreSQL Semantic Schema Design

The semantic schema should remain relational first.

Do not introduce a separate graph database before the relational ontology spine is proven.

Why PostgreSQL first

PostgreSQL already gives the platform:

  1. transactional integrity,
  2. tenant/project scoping,
  3. JSON support,
  4. GIN indexing,
  5. compatibility with current migrations and APIs,
  6. easy joinability with raw, ops, and audit.

Required semantic tables

The semantic spine migration should create:

  1. semantic.entity_types
  2. semantic.relation_types
  3. semantic.entities
  4. semantic.entity_links
  5. semantic.entity_source_links
  6. semantic.query_profiles

These are implemented in:

semantic.entity_types

Purpose:

  1. controlled list of canonical entity types,
  2. domain grouping,
  3. human-readable descriptions.

semantic.relation_types

Purpose:

  1. controlled list of valid relationship types,
  2. expected source and target entity type keys,
  3. documentation for relationship meaning.

semantic.entities

Purpose:

  1. store project-scoped semantic entities,
  2. mark preferred records,
  3. preserve provenance and attributes,
  4. provide stable keys for joins and retrieval.

Important fields:

  1. project_id
  2. entity_type_key
  3. entity_key
  4. display_name
  5. confidence
  6. is_preferred
  7. canonical_ref
  8. attributes
  9. provenance

semantic.entity_links

Purpose:

  1. store semantic relationships,
  2. preserve confidence and provenance on the relationship itself,
  3. allow graph-like queries within PostgreSQL.

semantic.entity_source_links

Purpose:

  1. trace each entity back to source objects and uploads,
  2. preserve authority rank,
  3. preserve extraction method,
  4. support evidence-driven AI retrieval.

semantic.query_profiles

Purpose:

  1. define reusable semantic query contracts,
  2. specify required entity types and relation types,
  3. define expected output structure for workflows and AI.

Semantic Query Contract

The AI layer should not query raw file catalogs first.

It should query semantic views shaped by a query profile.

Each semantic query contract should define:

  1. target domain,
  2. required entity types,
  3. required relation types,
  4. output contract,
  5. blocking gap behavior.

Example: well_payload_review

Target domain:

  1. interpretation

Requires:

  1. well
  2. log_run
  3. pay_event
  4. bypassed_pay_candidate
  5. data_gap
  6. evidence_pack

Output contract:

  1. ranked candidates,
  2. blocking gaps,
  3. evidence references,
  4. spatial confidence.

Example: spatial_truth_audit

Target domain:

  1. spatial

Requires:

  1. well
  2. well_location
  3. spatial_reference
  4. transform_step
  5. data_gap

Output contract:

  1. preferred location,
  2. ECEF anchor,
  3. CRS chain,
  4. blocking gaps.

How AI Should Use This Layer

AI should retrieve:

  1. canonical entity summaries,
  2. relationship neighborhoods,
  3. confidence and gap states,
  4. evidence pack anchors,
  5. source snippets only when the canonical layer says they are relevant.

The AI should not:

  1. infer canonical well identity from unlinked raw rows,
  2. guess which CRS is authoritative when the semantic layer marks it unresolved,
  3. bypass gap severity or promotion rules,
  4. synthesize a final answer without evidence links.

Example AI Query Patterns

1. Payload ranking query

Goal:

Find wells in a project with reviewable bypassed-pay candidates.

Semantic intent:

  1. select preferred well,
  2. join to bypassed_pay_candidate,
  3. require supporting evidence_pack,
  4. exclude critical blocking data_gap,
  5. sort by candidate score and confidence.

2. Spatial quarantine query

Goal:

Find wells that should not appear in 3D or cross-well analytics.

Semantic intent:

  1. select well,
  2. join to well_location,
  3. inspect transform_step,
  4. join data_gap,
  5. filter where gap severity is critical in crs or depth.

3. Missing-completion review query

Goal:

Find wells with strong petrophysical intervals but weak completion context.

Semantic intent:

  1. select pay_event,
  2. join to well,
  3. join to data_gap,
  4. filter on:

4. AI context assembly query

Goal:

Build context for an LLM asking why a candidate is review-worthy.

Required context:

  1. preferred well identity,
  2. location summary and spatial confidence,
  3. log-run summary,
  4. pay-event metrics,
  5. candidate score and recommendation,
  6. blocking/non-blocking gaps,
  7. evidence pack references.

Operational Rules

Rule 1: Raw is never canonical

No raw source row becomes a semantic entity automatically without an explicit mapping or creation step.

Rule 2: Preferred records are explicit

The semantic layer must distinguish:

  1. alternate claims,
  2. preferred canonical record,
  3. unresolved conflict.

Rule 3: Gaps are first-class

data_gap is not documentation.

It is an operational entity that blocks promotion and AI confidence.

Rule 4: Every promoted candidate needs evidence

No bypassed_pay_candidate should be considered decision-grade without an evidence_pack link.

Rule 5: Spatial truth is not implied

A well is not spatially valid because it has coordinates.

It is spatially valid when:

  1. CRS is known,
  2. vertical reference is acceptable,
  3. transform chain is documented,
  4. blocking gaps are absent or downgraded explicitly.

How This Connects To Existing Schemas

raw

Holds source-native objects and extracted fields.

The semantic layer binds canonical entities back to these sources.

ops

Holds canonical operational records and derived outputs.

The semantic layer gives them a stable cross-domain graph.

audit

Holds the evidence and reproducibility contract.

The semantic layer should point operational outputs into audit, not replace it.

core

Holds tenants, projects, users, and policies.

The semantic layer should be project-aware and permission-compatible with core.

Build Order

The ontology should be implemented in phases.

Phase 1: ontology spine

  1. entity types,
  2. relation types,
  3. entities,
  4. entity links,
  5. source links,
  6. query profiles.

Phase 2: canonical population

  1. create semantic entities during ingest and normalization,
  2. create source links for raw files and uploads,
  3. mark preferred entities.

Phase 3: semantic views

  1. add materialized or logical views for:

Phase 4: AI retrieval contract

  1. use query profiles to build AI context payloads,
  2. expose model-safe retrieval endpoints,
  3. enforce gap-aware and evidence-aware prompt assembly.

What Success Looks Like

The ontology is working when:

  1. multiple source files resolve to one canonical well,
  2. one candidate can be traced to exact curves, transforms, and evidence,
  3. the UI can show preferred records and blocking gaps clearly,
  4. AI can retrieve decision-grade context without scanning raw chaos,
  5. new source schemas can be mapped into the same canonical graph without redesigning the platform.

Recommended Next Implementation Step

After the ontology spine migration, the next code work should be:

  1. semantic population helpers in the data plane,
  2. preferred-well semantic views,
  3. candidate review semantic views,
  4. AI retrieval endpoints built on semantic.query_profiles.