Lessons Learned from Building a Well-Matching Intelligence Layer

Introduction

In the energy industry, a well is a drilled borehole used to extract oil, gas, or other resources. Each well generates data across many stages of its lifecycle—drilling, fluids, subsurface analysis, cementing, and more. Different functional groups, often known as service lines, store this data in their own systems and formats. Because of this, a single physical well can show up many times with different names and details across the company.

Well matching is the process of determining which records across these disparate systems actually refer to the same physical well. It may seem simple, but it is actually one of the toughest problems in Oil & Gas data management. Solving it requires deep Exploratory Data Analysis (EDA) and a deliberate and structured approach to Ground Truth generation. Simply put, teams need to carefully study the data and create clear rules for what counts as a match.

This post shares important learnings from the EDA and Ground Truth journey—two areas that strongly influence the success of any well-matching initiative.

Why This Problem Matters

Well identity resolution is essential for digital workflows in the energy industry. When wells are matched correctly, teams can:

Analyze data from different service lines
Plan and engineer well construction more efficiently
Automate repetitive tasks
Report accurately and meet regulations
Build reliable AI systems

Without a unified understanding of what constitutes “the same well,” organizations struggle to scale digital transformation, reduce operational inefficiencies, or build reusable AI/ML solutions. Solving well matching unlocks all these benefits.

1. The Core Challenge: A Single Well, Many Identities

Across the industry, there is no universal standard for how wells are named or tracked. This causes confusion and makes matching wells difficult. Common issues include:

Key contextual knowledge living in the heads of engineers
Missing or inconsistent metadata
Identical attributes stored in different fields across systems
Some service lines storing information in SQL while others use JSON or spreadsheets
Optional fields left blank depending on service line workflow priorities

For example, consider these common variations of the same well found across different service line systems:

Common Naming Pattern Variations:

Downhole Services: Full project code with regional suffix
Fluids Management: Simplified well name with separate geographic fields
Field Operations: Prospect-focused naming with job context
Portfolio Management: Full lease block designation with sublocation details

Regional Identifier Variations:

Formal region names vs common abbreviations vs alternative capitalizations
Embedded location codes within well names
Varying levels of geographic specificity across systems

Structural Pattern Examples:

Well names with/without lease block: "[ASSET] [BLOCK][NUMBER] [SUBLOCATION]" vs "[ASSET]"
Project vs well focus: "[PROJECT]_[WELL]" vs "[WELL]_[SERVICE]" vs "[OPERATOR]_[LOCATION]"
Abbreviated vs full names: "[BLOCK]-[NUMBER]" vs "[Full Block Name] Well [Number]"
Geographic variations: "REGION SITE" vs "RGN SITE" vs "R-SITE"
Pilot well patterns: "PILOT WELL" vs "PLT_WL" vs "PLT-01"

Before using AI to match wells, teams must first understand how data is collected and how domain experts have historically identified wells.

2. Exploratory Data Analysis (EDA): Capturing the Real-World Knowledge

EDA becomes a knowledge-collection effort as much as a data-profiling effort. It reveals how wells are documented in practice and surfaces inconsistencies, gaps, and hidden assumptions.

A. Industry Knowledge Was Tribal and Distributed

This knowledge was characterized by:

Engineers maintained personal mental models of how wells were organized
Documentation was incomplete, inconsistent, or scattered
Engineers frequently had to search for legacy notes or email colleagues to reconstruct context

Documenting this knowledge was key to building a good matching system.

B. Data Fragmentation Across Service Lines

Different service lines recorded the same conceptual information in different ways:

SQL tables vs. JSON
Different field names for similar attributes
Missing fields depending on service line priorities
Different interpretations of what constitutes a “well identifier”

Illustrative Examples of Fragmentation:

Different service lines structured data with varying schemas:

Downhole Services: May include operator and project fields but leave geographic fields empty
Fluids Management: May duplicate information across job and well name fields while including detailed geographic context
Field Operations: May use prospect-based naming with asset and basin categorization

Common fragmentation patterns include:

The same operator information appearing in different fields across systems
Geographic data with varying completeness (empty in some systems, detailed in others)
Well identifiers using entirely different naming schemes
Some systems prioritizing job vs well vs asset naming conventions

This fragmentation made simple matching rules unreliable.

C. What EDA Helped Clarify

EDA helped teams:

Understand data relationships across service lines
Identify fields suitable for correlation
Compare schema similarities and differences
Document manual correlation patterns used by engineers
Capture naming conventions and heuristics used across workflows

This created a centralized knowledge base for engineers, data teams, and model developers.

D. How AI tools supported EDA

AI assistants helped accelerate EDA by:

Searching and summarizing heterogeneous datasets
Surfacing schema inconsistencies
Organizing subject matter expert (SME) knowledge
Producing appendices, schema maps, and correlation pattern summaries

This helped teams turn scattered information into useful insights more quickly.

3. Ground Truth Generation: The Hardest and Most Critical Step

Ground truth (GT) is the foundation for the accuracy of model evaluation and the effectiveness of matching algorithms, yet creating it is one of the most complex tasks.

A. Challenges in Ground Truth Creation

1. Manual Curation Was Difficult

SMEs struggled to determine with confidence if “Well A” in one system is the same as “Well B” in another, because names and details (missing metadata) often do not match.

2. Ambiguity Affected Evaluation Metrics

Uncertain matches affected:

Whether accuracy should be operator-specific or overall
How to treat one-to-many or many-to-one matches
How naming variations impacted similarity scores

GT needed to reflect real‑world complexity.

3. Schema Uncertainty: Pairwise or Multi-Well?

Teams had to decide whether a test case should represent:

A ↔ B (pairwise)

X ↔ {Y1, Y2, Y3} (cluster‑based)

Different operators and service lines naturally produced different structures.

Ground Truth Pattern Categories:

The complexity becomes clear when examining the types of ground truth scenarios that must be represented:

No correlations: Wells that exist in only one system and have no corresponding records elsewhere
Simple pairwise (A ↔ B): A target well with one or more direct correlations to records in other service lines
Multi-well clusters (X ↔ {Y1, Y2, Y3}): A target well that correlates to multiple records, potentially representing different operational phases or duplicate entries

Schema Decision Points:

No correlations: How to represent wells that exist in only one system?
Pairwise vs clusters: Should correlations be simple 1:1 relationships or allow complex many-to-many groupings?
Operational phases: Are a base well name and its workover/sidetrack variants the same entity or different phases?

This forced decisions about whether operational phases constitute different entities or variations of the same well.

B. Approaches That Worked Well

1. Grouping by Operator

Operators differ in naming practices, data quality, and service line participation. Grouping data by operator simplified SME review and improved consistency.

2. Defining a Clear Correlation Schema

A consistent schema helped teams define:

What counts as a match
Whether clusters are acceptable
How ambiguous or multi-record cases should be documented

3. Iterative SME Collaboration

SMEs validated matches only after:

Data was cleaned
Inconsistencies were surfaced
Ambiguous cases were contextualized

Iterative cycles ensured alignment across engineering and data teams.

4. Using EDA as the Foundation

EDA outputs—schema relationships, naming conventions, service line logic—became the rulebook for generating GT. Ground truth reflected actual operational practice rather than theoretical assumptions.

How These Learnings Apply to Other Domains

The challenges described here—tribal knowledge, inconsistent identifiers, schema fragmentation, ambiguous correlations, and messy ground truth—appear in every industry where operational data is distributed across multiple systems. The methods used in this project offer practical, reusable patterns for similar data challenges—asset matching, patient identity, product catalog unification, counterparty deduplication, or any large-scale entity resolution problem. Teams facing these use cases can adapt these approaches to their own contexts.

Key Learnings

EDA is just as important as building models.
Tribal knowledge must be captured before automation can work
Data fragmentation is the main reason matching is hard
The structure of the data matters most for ground truth; expertise comes next
Ambiguity is normal and must be modeled
Operator‑specific grouping reduces complexity
Schema definition drives all downstream components
AI accelerates synthesis, but SMEs validate truth

Conclusion

Well matching is deceptively simple on the surface but deeply complex in practice. The biggest progress came from:

Capturing tribal and expert knowledge through EDA
Mapping real-world data relationships across service lines
Formalizing human correlation logic
Building Ground Truth based on real data
Consolidating scattered information into shared, reusable artifacts

These lessons extend beyond the Energy industry and apply to any domain where operational data is fragmented across systems and shaped by decades of legacy practices.

A disciplined EDA and ground truth process provides the strongest foundation for scalable, accurate, and trustworthy entity-matching systems.

The feature image was generated using Bing Image Creator. Terms can be found here.

Acknowledgements

This work would not have been possible without the contributions of our talented team: