Earthbond Ontology Blueprint And AI Query Layer

Purpose

This document defines the minimum viable ontology spine for Earthbond.

It is not a universal geoscience ontology.

It is the governed semantic layer required to:

normalize heterogeneous well, drilling, spatial, and evidence data,
preserve source-specific meaning without flattening everything into one table,
expose stable entities and relationships for workflow logic,
support AI retrieval and model execution against canonical facts instead of raw file chaos.

The ontology must be good enough for production workflows before it is broad enough for research.

Design Position

The platform should not begin with a giant ontology project.

It should begin with a workflow ontology centered on:

well identity,
spatial truth,
log normalization,
completion context,
pay-event and candidate outputs,
evidence and gap tracking.

That is enough to drive:

deterministic workflow routing,
audit-ready outputs,
AI query and ranking,
future expansion into additional source classes.

High-Level Semantic Architecture

flowchart TD A["Raw Source Schemas"] --> B["Source Bundle Registry"] B --> C["Canonical Entity Layer"] C --> D["Relationship Layer"] D --> E["Truth Index / Semantic Query Layer"] E --> F["Rules Engines"] E --> G["AI Retrieval + Model Context"] F --> H["Operational Outputs"] G --> H H --> I["Evidence Pack + Audit"]

Why This Ontology Is Needed

Without ontology, the system has:

multiple source formats,
inconsistent field names,
conflicting identifiers,
ambiguous CRS and vertical semantics,
weak links between source evidence and derived outputs.

That makes AI unreliable because the model cannot distinguish:

raw observations from canonical facts,
preferred records from alternates,
evidence from inference,
confidence from certainty.

The ontology resolves that by giving the platform:

stable entity definitions,
controlled relationships,
preferred-record selection,
provenance links,
query contracts for workflows and AI.

Ontology Layers

1. Source Schema Layer

This preserves source-native structure.

Examples:

LAS header fields,
DLIS metadata,
scanned PDF OCR tables,
CalGEM well fields,
completion spreadsheets,
survey CSV exports,
point-cloud sidecar metadata.

This layer should not be forced into canonical names too early.

It belongs in raw.

2. Canonical Entity Layer

This is the first real semantic contract.

Each canonical entity must have:

a stable entity_type,
a stable entity_key,
project scoping,
provenance,
confidence,
preferred/not-preferred state.

This layer belongs in semantic.entities.

3. Relationship Layer

This expresses how entities connect.

Examples:

a source object describes a well,
a well has a location,
a location uses a spatial reference,
a pay event is derived from a log run,
a candidate is proved by an evidence pack.

This layer belongs in semantic.entity_links.

4. Evidence Binding Layer

Every semantic entity that matters operationally must be linked back to:

source object,
source bundle,
upload,
authority rank,
extraction method.

This layer belongs in semantic.entity_source_links.

5. Query Profile Layer

The platform needs a stable semantic query contract for:

workflow modules,
dashboards,
AI retrieval,
downstream models.

This belongs in semantic.query_profiles.

Minimum Canonical Entity Model

The initial ontology spine should include these entity types.

Control / project

project

Raw/source

source_bundle
source_object

Identity / well master

well
well_identifier

Spatial

well_location
spatial_reference
transform_step

Subsurface geometry

trajectory
survey_station

Petrophysics

log_run
curve

Geology / completion / production

formation_top
completion_interval
production_record

Interpretation

pay_event
bypassed_pay_candidate

Governance / audit

data_gap
evidence_pack

This set is intentionally narrow.

It supports the current Earthbond POV without overcommitting to a huge ontology program.

Entity Semantics

`well`

Represents a canonical well identity across multiple source records.

It is not one file and not one regulator row.

`well_location`

Represents a location claim or resolved preferred location for a well.

It may be surface or bottom-hole.

It must carry CRS/datum/epoch/vertical semantics.

`spatial_reference`

Represents the formal CRS and vertical reference definition used by a location or transform.

This can describe:

source CRS,
resolved EPSG,
datum realization,
vertical datum,
epoch,
unit system.

`transform_step`

Represents a single documented transformation or normalization step.

Examples:

source projected CRS -> WGS84 geodetic,
geodetic + ellipsoidal height -> ECEF,
MD -> TVDSS using minimum curvature.

`log_run`

Represents a specific well-log run with its source context.

This is distinct from the well itself.

`curve`

Represents a curve within a log run after mnemonic and unit normalization.

`pay_event`

Represents a derived subsurface interval of interest.

It is deterministic output, not raw source.

`bypassed_pay_candidate`

Represents a ranked review candidate.

It is downstream of:

normalized logs,
completion reconciliation,
gap assessment,
confidence scoring.

`data_gap`

Represents a structured missing, conflicting, or insufficient-data signal.

It must be queryable, not hidden in narrative notes.

`evidence_pack`

Represents the reproducibility contract.

It should link:

source inputs,
transform chain,
formulas/cutoffs,
outputs,
audit metadata.

Relationship Model

The first relationship set should remain small and high-value.

PostgreSQL Semantic Schema Design

The semantic schema should remain relational first.

Do not introduce a separate graph database before the relational ontology spine is proven.

Why PostgreSQL first

PostgreSQL already gives the platform:

transactional integrity,
tenant/project scoping,
JSON support,
GIN indexing,
compatibility with current migrations and APIs,
easy joinability with raw, ops, and audit.

Required semantic tables

The semantic spine migration should create:

semantic.entity_types
semantic.relation_types
semantic.entities
semantic.entity_links
semantic.entity_source_links
semantic.query_profiles

These are implemented in:

db/migrations/versions/0017_semantic_ontology_spine.py

`semantic.entity_types`

Purpose:

controlled list of canonical entity types,
domain grouping,
human-readable descriptions.

`semantic.relation_types`

Purpose:

controlled list of valid relationship types,
expected source and target entity type keys,
documentation for relationship meaning.

`semantic.entities`

Purpose:

store project-scoped semantic entities,
mark preferred records,
preserve provenance and attributes,
provide stable keys for joins and retrieval.

Important fields:

project_id
entity_type_key
entity_key
display_name
confidence
is_preferred
canonical_ref
attributes
provenance

`semantic.entity_links`

Purpose:

store semantic relationships,
preserve confidence and provenance on the relationship itself,
allow graph-like queries within PostgreSQL.

`semantic.entity_source_links`

Purpose:

trace each entity back to source objects and uploads,
preserve authority rank,
preserve extraction method,
support evidence-driven AI retrieval.

`semantic.query_profiles`

Purpose:

define reusable semantic query contracts,
specify required entity types and relation types,
define expected output structure for workflows and AI.

Semantic Query Contract

The AI layer should not query raw file catalogs first.

It should query semantic views shaped by a query profile.

Each semantic query contract should define:

target domain,
required entity types,
required relation types,
output contract,
blocking gap behavior.

Example: `well_payload_review`

Target domain:

interpretation

Requires:

well
log_run
pay_event
bypassed_pay_candidate
data_gap
evidence_pack

Output contract:

ranked candidates,
blocking gaps,
evidence references,
spatial confidence.

Example: `spatial_truth_audit`

Target domain:

spatial

Requires:

well
well_location
spatial_reference
transform_step
data_gap

Output contract:

preferred location,
ECEF anchor,
CRS chain,
blocking gaps.

How AI Should Use This Layer

AI should retrieve:

canonical entity summaries,
relationship neighborhoods,
confidence and gap states,
evidence pack anchors,
source snippets only when the canonical layer says they are relevant.

The AI should not:

infer canonical well identity from unlinked raw rows,
guess which CRS is authoritative when the semantic layer marks it unresolved,
bypass gap severity or promotion rules,
synthesize a final answer without evidence links.

Example AI Query Patterns

1. Payload ranking query

Goal:

Find wells in a project with reviewable bypassed-pay candidates.

Semantic intent:

select preferred well,
join to bypassed_pay_candidate,
require supporting evidence_pack,
exclude critical blocking data_gap,
sort by candidate score and confidence.

2. Spatial quarantine query

Goal:

Find wells that should not appear in 3D or cross-well analytics.

Semantic intent:

select well,
join to well_location,
inspect transform_step,
join data_gap,
filter where gap severity is critical in crs or depth.

3. Missing-completion review query

Goal:

Find wells with strong petrophysical intervals but weak completion context.

Semantic intent:

select pay_event,
join to well,
join to data_gap,
filter on:

pay-event confidence high enough,
completion gap severity high,
no evidence pack or incomplete evidence.

4. AI context assembly query

Goal:

Build context for an LLM asking why a candidate is review-worthy.

Required context:

preferred well identity,
location summary and spatial confidence,
log-run summary,
pay-event metrics,
candidate score and recommendation,
blocking/non-blocking gaps,
evidence pack references.

Operational Rules

Rule 1: Raw is never canonical

No raw source row becomes a semantic entity automatically without an explicit mapping or creation step.

Rule 2: Preferred records are explicit

The semantic layer must distinguish:

alternate claims,
preferred canonical record,
unresolved conflict.

Rule 3: Gaps are first-class

data_gap is not documentation.

It is an operational entity that blocks promotion and AI confidence.

Rule 4: Every promoted candidate needs evidence

No bypassed_pay_candidate should be considered decision-grade without an evidence_pack link.

Rule 5: Spatial truth is not implied

A well is not spatially valid because it has coordinates.

It is spatially valid when:

CRS is known,
vertical reference is acceptable,
transform chain is documented,
blocking gaps are absent or downgraded explicitly.

How This Connects To Existing Schemas

`raw`

Holds source-native objects and extracted fields.

The semantic layer binds canonical entities back to these sources.

`ops`

Holds canonical operational records and derived outputs.

The semantic layer gives them a stable cross-domain graph.

`audit`

Holds the evidence and reproducibility contract.

The semantic layer should point operational outputs into audit, not replace it.

`core`

Holds tenants, projects, users, and policies.

The semantic layer should be project-aware and permission-compatible with core.

Build Order

The ontology should be implemented in phases.

Phase 1: ontology spine

entity types,
relation types,
entities,
entity links,
source links,
query profiles.

Phase 2: canonical population

create semantic entities during ingest and normalization,
create source links for raw files and uploads,
mark preferred entities.

Phase 3: semantic views

add materialized or logical views for:

preferred wells,
spatial truth status,
candidate review queue,
evidence-backed outputs.

Phase 4: AI retrieval contract

use query profiles to build AI context payloads,
expose model-safe retrieval endpoints,
enforce gap-aware and evidence-aware prompt assembly.

What Success Looks Like

The ontology is working when:

multiple source files resolve to one canonical well,
one candidate can be traced to exact curves, transforms, and evidence,
the UI can show preferred records and blocking gaps clearly,
AI can retrieve decision-grade context without scanning raw chaos,
new source schemas can be mapped into the same canonical graph without redesigning the platform.

Recommended Next Implementation Step

After the ontology spine migration, the next code work should be:

semantic population helpers in the data plane,
preferred-well semantic views,
candidate review semantic views,
AI retrieval endpoints built on semantic.query_profiles.

Earthbond Ontology Blueprint And AI Query Layer

Purpose

Design Position

High-Level Semantic Architecture

Why This Ontology Is Needed

Ontology Layers

1. Source Schema Layer

2. Canonical Entity Layer

3. Relationship Layer

4. Evidence Binding Layer

5. Query Profile Layer

Minimum Canonical Entity Model

Control / project

Raw/source

Identity / well master

Spatial

Subsurface geometry

Petrophysics

Geology / completion / production

Interpretation

Governance / audit

Entity Semantics

well

well_location

spatial_reference

transform_step

log_run

curve

pay_event

bypassed_pay_candidate

data_gap

evidence_pack

Relationship Model

PostgreSQL Semantic Schema Design

Why PostgreSQL first

Required semantic tables

semantic.entity_types

semantic.relation_types

semantic.entities

semantic.entity_links

semantic.entity_source_links

semantic.query_profiles

Semantic Query Contract

Example: well_payload_review

Example: spatial_truth_audit

How AI Should Use This Layer

Example AI Query Patterns

1. Payload ranking query

2. Spatial quarantine query

3. Missing-completion review query

4. AI context assembly query

Operational Rules

Rule 1: Raw is never canonical

Rule 2: Preferred records are explicit

Rule 3: Gaps are first-class

Rule 4: Every promoted candidate needs evidence

Rule 5: Spatial truth is not implied

How This Connects To Existing Schemas

raw

ops

audit

core

Build Order

Phase 1: ontology spine

Phase 2: canonical population

Phase 3: semantic views

Phase 4: AI retrieval contract

What Success Looks Like

Recommended Next Implementation Step

`well`

`well_location`

`spatial_reference`

`transform_step`

`log_run`

`curve`

`pay_event`

`bypassed_pay_candidate`

`data_gap`

`evidence_pack`

`semantic.entity_types`

`semantic.relation_types`

`semantic.entities`

`semantic.entity_links`

`semantic.entity_source_links`

`semantic.query_profiles`

Example: `well_payload_review`

Example: `spatial_truth_audit`

`raw`

`ops`

`audit`

`core`