Earthbond Ontology Blueprint And AI Query Layer
Purpose
This document defines the minimum viable ontology spine for Earthbond.
It is not a universal geoscience ontology.
It is the governed semantic layer required to:
- normalize heterogeneous well, drilling, spatial, and evidence data,
- preserve source-specific meaning without flattening everything into one table,
- expose stable entities and relationships for workflow logic,
- support AI retrieval and model execution against canonical facts instead of raw file chaos.
The ontology must be good enough for production workflows before it is broad enough for research.
Design Position
The platform should not begin with a giant ontology project.
It should begin with a workflow ontology centered on:
- well identity,
- spatial truth,
- log normalization,
- completion context,
- pay-event and candidate outputs,
- evidence and gap tracking.
That is enough to drive:
- deterministic workflow routing,
- audit-ready outputs,
- AI query and ranking,
- future expansion into additional source classes.
High-Level Semantic Architecture
Why This Ontology Is Needed
Without ontology, the system has:
- multiple source formats,
- inconsistent field names,
- conflicting identifiers,
- ambiguous CRS and vertical semantics,
- weak links between source evidence and derived outputs.
That makes AI unreliable because the model cannot distinguish:
- raw observations from canonical facts,
- preferred records from alternates,
- evidence from inference,
- confidence from certainty.
The ontology resolves that by giving the platform:
- stable entity definitions,
- controlled relationships,
- preferred-record selection,
- provenance links,
- query contracts for workflows and AI.
Ontology Layers
1. Source Schema Layer
This preserves source-native structure.
Examples:
- LAS header fields,
- DLIS metadata,
- scanned PDF OCR tables,
- CalGEM well fields,
- completion spreadsheets,
- survey CSV exports,
- point-cloud sidecar metadata.
This layer should not be forced into canonical names too early.
It belongs in raw.
2. Canonical Entity Layer
This is the first real semantic contract.
Each canonical entity must have:
- a stable
entity_type, - a stable
entity_key, - project scoping,
- provenance,
- confidence,
- preferred/not-preferred state.
This layer belongs in semantic.entities.
3. Relationship Layer
This expresses how entities connect.
Examples:
- a source object describes a well,
- a well has a location,
- a location uses a spatial reference,
- a pay event is derived from a log run,
- a candidate is proved by an evidence pack.
This layer belongs in semantic.entity_links.
4. Evidence Binding Layer
Every semantic entity that matters operationally must be linked back to:
- source object,
- source bundle,
- upload,
- authority rank,
- extraction method.
This layer belongs in semantic.entity_source_links.
5. Query Profile Layer
The platform needs a stable semantic query contract for:
- workflow modules,
- dashboards,
- AI retrieval,
- downstream models.
This belongs in semantic.query_profiles.
Minimum Canonical Entity Model
The initial ontology spine should include these entity types.
Control / project
project
Raw/source
source_bundlesource_object
Identity / well master
wellwell_identifier
Spatial
well_locationspatial_referencetransform_step
Subsurface geometry
trajectorysurvey_station
Petrophysics
log_runcurve
Geology / completion / production
formation_topcompletion_intervalproduction_record
Interpretation
pay_eventbypassed_pay_candidate
Governance / audit
data_gapevidence_pack
This set is intentionally narrow.
It supports the current Earthbond POV without overcommitting to a huge ontology program.
Entity Semantics
well
Represents a canonical well identity across multiple source records.
It is not one file and not one regulator row.
well_location
Represents a location claim or resolved preferred location for a well.
It may be surface or bottom-hole.
It must carry CRS/datum/epoch/vertical semantics.
spatial_reference
Represents the formal CRS and vertical reference definition used by a location or transform.
This can describe:
- source CRS,
- resolved EPSG,
- datum realization,
- vertical datum,
- epoch,
- unit system.
transform_step
Represents a single documented transformation or normalization step.
Examples:
- source projected CRS -> WGS84 geodetic,
- geodetic + ellipsoidal height -> ECEF,
- MD -> TVDSS using minimum curvature.
log_run
Represents a specific well-log run with its source context.
This is distinct from the well itself.
curve
Represents a curve within a log run after mnemonic and unit normalization.
pay_event
Represents a derived subsurface interval of interest.
It is deterministic output, not raw source.
bypassed_pay_candidate
Represents a ranked review candidate.
It is downstream of:
- normalized logs,
- completion reconciliation,
- gap assessment,
- confidence scoring.
data_gap
Represents a structured missing, conflicting, or insufficient-data signal.
It must be queryable, not hidden in narrative notes.
evidence_pack
Represents the reproducibility contract.
It should link:
- source inputs,
- transform chain,
- formulas/cutoffs,
- outputs,
- audit metadata.
Relationship Model
The first relationship set should remain small and high-value.
PostgreSQL Semantic Schema Design
The semantic schema should remain relational first.
Do not introduce a separate graph database before the relational ontology spine is proven.
Why PostgreSQL first
PostgreSQL already gives the platform:
- transactional integrity,
- tenant/project scoping,
- JSON support,
- GIN indexing,
- compatibility with current migrations and APIs,
- easy joinability with
raw,ops, andaudit.
Required semantic tables
The semantic spine migration should create:
semantic.entity_typessemantic.relation_typessemantic.entitiessemantic.entity_linkssemantic.entity_source_linkssemantic.query_profiles
These are implemented in:
db/migrations/versions/0017_semantic_ontology_spine.py
semantic.entity_types
Purpose:
- controlled list of canonical entity types,
- domain grouping,
- human-readable descriptions.
semantic.relation_types
Purpose:
- controlled list of valid relationship types,
- expected source and target entity type keys,
- documentation for relationship meaning.
semantic.entities
Purpose:
- store project-scoped semantic entities,
- mark preferred records,
- preserve provenance and attributes,
- provide stable keys for joins and retrieval.
Important fields:
project_identity_type_keyentity_keydisplay_nameconfidenceis_preferredcanonical_refattributesprovenance
semantic.entity_links
Purpose:
- store semantic relationships,
- preserve confidence and provenance on the relationship itself,
- allow graph-like queries within PostgreSQL.
semantic.entity_source_links
Purpose:
- trace each entity back to source objects and uploads,
- preserve authority rank,
- preserve extraction method,
- support evidence-driven AI retrieval.
semantic.query_profiles
Purpose:
- define reusable semantic query contracts,
- specify required entity types and relation types,
- define expected output structure for workflows and AI.
Semantic Query Contract
The AI layer should not query raw file catalogs first.
It should query semantic views shaped by a query profile.
Each semantic query contract should define:
- target domain,
- required entity types,
- required relation types,
- output contract,
- blocking gap behavior.
Example: well_payload_review
Target domain:
- interpretation
Requires:
welllog_runpay_eventbypassed_pay_candidatedata_gapevidence_pack
Output contract:
- ranked candidates,
- blocking gaps,
- evidence references,
- spatial confidence.
Example: spatial_truth_audit
Target domain:
- spatial
Requires:
wellwell_locationspatial_referencetransform_stepdata_gap
Output contract:
- preferred location,
- ECEF anchor,
- CRS chain,
- blocking gaps.
How AI Should Use This Layer
AI should retrieve:
- canonical entity summaries,
- relationship neighborhoods,
- confidence and gap states,
- evidence pack anchors,
- source snippets only when the canonical layer says they are relevant.
The AI should not:
- infer canonical well identity from unlinked raw rows,
- guess which CRS is authoritative when the semantic layer marks it unresolved,
- bypass gap severity or promotion rules,
- synthesize a final answer without evidence links.
Example AI Query Patterns
1. Payload ranking query
Goal:
Find wells in a project with reviewable bypassed-pay candidates.
Semantic intent:
- select preferred
well, - join to
bypassed_pay_candidate, - require supporting
evidence_pack, - exclude
criticalblockingdata_gap, - sort by candidate score and confidence.
2. Spatial quarantine query
Goal:
Find wells that should not appear in 3D or cross-well analytics.
Semantic intent:
- select
well, - join to
well_location, - inspect
transform_step, - join
data_gap, - filter where gap severity is
criticalincrsordepth.
3. Missing-completion review query
Goal:
Find wells with strong petrophysical intervals but weak completion context.
Semantic intent:
- select
pay_event, - join to
well, - join to
data_gap, - filter on:
- pay-event confidence high enough,
- completion gap severity high,
- no evidence pack or incomplete evidence.
4. AI context assembly query
Goal:
Build context for an LLM asking why a candidate is review-worthy.
Required context:
- preferred well identity,
- location summary and spatial confidence,
- log-run summary,
- pay-event metrics,
- candidate score and recommendation,
- blocking/non-blocking gaps,
- evidence pack references.
Operational Rules
Rule 1: Raw is never canonical
No raw source row becomes a semantic entity automatically without an explicit mapping or creation step.
Rule 2: Preferred records are explicit
The semantic layer must distinguish:
- alternate claims,
- preferred canonical record,
- unresolved conflict.
Rule 3: Gaps are first-class
data_gap is not documentation.
It is an operational entity that blocks promotion and AI confidence.
Rule 4: Every promoted candidate needs evidence
No bypassed_pay_candidate should be considered decision-grade without an evidence_pack link.
Rule 5: Spatial truth is not implied
A well is not spatially valid because it has coordinates.
It is spatially valid when:
- CRS is known,
- vertical reference is acceptable,
- transform chain is documented,
- blocking gaps are absent or downgraded explicitly.
How This Connects To Existing Schemas
raw
Holds source-native objects and extracted fields.
The semantic layer binds canonical entities back to these sources.
ops
Holds canonical operational records and derived outputs.
The semantic layer gives them a stable cross-domain graph.
audit
Holds the evidence and reproducibility contract.
The semantic layer should point operational outputs into audit, not replace it.
core
Holds tenants, projects, users, and policies.
The semantic layer should be project-aware and permission-compatible with core.
Build Order
The ontology should be implemented in phases.
Phase 1: ontology spine
- entity types,
- relation types,
- entities,
- entity links,
- source links,
- query profiles.
Phase 2: canonical population
- create semantic entities during ingest and normalization,
- create source links for raw files and uploads,
- mark preferred entities.
Phase 3: semantic views
- add materialized or logical views for:
- preferred wells,
- spatial truth status,
- candidate review queue,
- evidence-backed outputs.
Phase 4: AI retrieval contract
- use query profiles to build AI context payloads,
- expose model-safe retrieval endpoints,
- enforce gap-aware and evidence-aware prompt assembly.
What Success Looks Like
The ontology is working when:
- multiple source files resolve to one canonical well,
- one candidate can be traced to exact curves, transforms, and evidence,
- the UI can show preferred records and blocking gaps clearly,
- AI can retrieve decision-grade context without scanning raw chaos,
- new source schemas can be mapped into the same canonical graph without redesigning the platform.
Recommended Next Implementation Step
After the ontology spine migration, the next code work should be:
- semantic population helpers in the data plane,
- preferred-well semantic views,
- candidate review semantic views,
- AI retrieval endpoints built on
semantic.query_profiles.