Introduction: When Valid JSON Is Not Enough
100% of our outputs were valid JSON. 0% met the data contract.
We prototyped an LLM-powered system that converts operational shift logs into structured summaries consumed by dashboards, compliance workflows, and incoming crews. The output schema required specific operational sections because downstream consumers depend on them: the next shift’s incoming crew, the compliance dashboard, and the audit trail.
The prototype was fast. Working demo in days. The JSON parsed every time. A stakeholder could read the summaries and nod along. Then we ran our evaluation pipeline.
Format validity: 100%. Every output was well-formed JSON. Schema compliance: 0%. Not a single output contained all required operational sections with actual content.
The LLM was producing JSON where required sections were missing their subfields entirely or contained empty arrays. An incoming crew member opening this summary would find no record of events, no pending tasks, no expected conditions, not because the shift was uneventful, but because the model silently dropped them.
An important caveat: all evaluation ran against synthetic logs designed to cover structural diversity, not real operational data. The field classification analysis was based on schema definitions, not validated against subject matter expert expectations.
Why the Prototype Failed: The LLM as Serializer
Our first approach was the natural one: a single prompt that receives the entire shift log and produces the entire JSON output. Give the LLM the input, the schema, and ask it to produce the full document.
This makes the LLM a lossy serializer: it reads, decides, and generates everything.
A schema validation component built alongside the prototype checks every output: does each required section exist? Are arrays non-empty, values non-null?
The results were clear. We were checking something more basic than factual accuracy: did the required keys consistently show up? They did not. Required sections appeared in some outputs and vanished in others with no diagnosable pattern.
Format validity was 100%. Contract reliability was near zero. Automated schema validation revealed the LLM was a lossy serializer, producing JSON that parsed but did not reliably contain the required content.
The single-prompt approach also violated three Responsible AI principles: it lacked
- transparency (no way to know which parts came from source data vs. model inference)
- accountability (when a field was wrong, no way to trace it to a specific decision point), and
- reliability (the model silently dropped required content with no detection mechanism).
HVE: Using Evaluation to Determine Where AI Belongs
Once schema validation told us the single-prompt approach was unreliable, we needed to understand why, not at the prompt level, but at the field level. This is where we applied HVE (Hypervelocity Engineering), an engineering philosophy where continuous evaluation, human judgment, and automated feedback loops work together not as quality gates after the fact, but as instruments for making architectural decisions.
We started with something straightforward: we mapped every field in the output schema back to its source in the input and asked a simple question for each one:
Does this field have a 1:1 mapping to a value that already exists in the input?
The exercise was deliberately low-tech: a spreadsheet with every output field in one column and its input source in another. For each row, we traced the field back to the raw data and made a binary call. The whole classification took hours, not days, and over half the fields fell on the deterministic side.
The answer split cleanly. Fields with a direct source of truth in the input: timestamps pulled from event logs, log entry IDs that serve as audit anchors, severity levels assigned by the operator, raw descriptions written during the shift. These fields have exactly one correct value, and that value is already sitting in the input data. Asking an LLM to “generate” them is asking it to be a copy machine that sometimes makes mistakes.
Fields that genuinely require interpretation: classifying whether a condition is one type of event vs. another, determining whether an issue was still active by reasoning across the timeline, extracting structured identifiers from free-text descriptions, parsing handover notes into follow-up items. These have no single ground truth in the input. They require the model to read, reason, and decide.
This field-by-field analysis is the human judgment component of HVE. It is the analytical work that determines what the system should and should not delegate to AI.
We codified it into a YAML field classification registry categorizing every output field as deterministic (copy from source, no LLM permitted) or llm_required (requires AI interpretation). That registry became the architectural contract for the entire system and the basis for the 4-pass pipeline design.
The prompt was fine. The problem was asking the LLM to do work that software should own.
Deterministic facts should be guaranteed by software. Probabilistic judgments should be delegated to the model, explicitly and narrowly. This principle aligns directly with Responsible AI practices: transparency (making clear which components own which decisions) and accountability (ensuring every output traces to either a verified source or an explicit AI judgment).
The 4-Pass Pipeline
The field classification registry dictated the architecture. The evaluation framework’s findings drove each pass; the architecture emerged from the data, not from upfront design.
Pass 1: Deterministic Extraction (No LLM)
Software builds the contract scaffold: copies all ground-truth fields directly from the input, merges and chronologically orders timelines from source arrays, and preserves raw source records for auditability. These fields are provably correct because they are identical to the source data. A unit test can assert output.severity == input.severity for every record, and it does, across the full test suite.
Pass 2: Bounded Model Judgment
The model receives raw input arrays plus the Pass 1 timeline and is asked only for decisions the registry marked as llm_required:
- Routing: which output section does each input record belong in?
- Classification: what are the AI-required field values for each record?
Every answer is keyed by log_entry_id, the same stable identifier from the original log. This makes every AI decision traceable to a specific input record. The model answers specific questions rather than generating the document.
Pass 3: Guarded Merge (No LLM)
This is the most architecturally significant pass. AI outputs from Pass 2 are merged into the deterministic scaffold using log_entry_id as the join key. The merge always builds each output record from the original source, then adds only the fields the model was asked to produce:
# Index source records by log_entry_id for O(1) lookup
source_by_id = {
r["log_entry_id"]: r
for r in log_data.get("conditions", [])
}
# For each routed record, build output from SOURCE, not from LLM
for entry_id in routing_decisions.get(section_name, []):
source = source_by_id[entry_id]
record = {
# Deterministic fields: always from source
"log_entry_id": source["log_entry_id"],
"timestamp": source.get("event_time"),
"description": source.get("description"),
"severity": source.get("severity"),
# AI-required fields: from Pass 2 only
"classification": ai_fields.get(entry_id, {}).get("classification"),
"identifier": ai_fields.get(entry_id, {}).get("identifier"),
}
section_records.append(record)
Deterministic fields always win by construction. If the LLM returned a value for a deterministic field that differs from the source, the system uses the source value and logs a violation recording the LLM’s attempted value alongside the source. The merge code for deterministic fields never consults LLM output.
The output includes a data provenance section recording the population status of every section, allowing downstream consumers to distinguish “no findings because nothing happened” from “section missing due to a bug.”
Pass 4: Evidence Mapping
Only once the summary is assembled does the model generate traceability, linking each conclusion back to specific source records with confidence assessments. This runs last because it needs the completed summary to reference.
This pass is the newest and least validated. Future work will evaluate whether the model’s mappings and confidence assessments hold up under rigorous human review.
Evaluation: How 0% Became 100%
The modular pipeline brought schema compliance from 0% to 47%. The evaluation framework told us where it fell short, and the diagnostic method mattered as much as the findings.
We ran 15 synthetic shift logs through the evaluation pipeline. Rather than reading individual outputs, we correlated each schema gap against properties of the corresponding input. Every failing shift had zero system conditions logged by the operator. The correlation was unambiguous.
That root cause manifested as three architecturally distinct failure modes:
Routing gaps: The LLM correctly routed no records to a section because no matching conditions existed, but Pass 3’s guard clause required source records to confirm “nothing to report.” With zero records, the guard fell through and sections were left silently empty.
Zero-input gaps: The operator logged nothing at all. There was nothing to route. The system needed to distinguish “nothing happened” from “section missing due to a bug.”
Empty follow-up arrays: The LLM returned nothing when asked about follow-ups. Same symptom as the routing gaps, but a different mechanism entirely.
Each fix followed one principle: document the absence with real metadata, never fabricate content. When no source conditions exist, the system creates an entry stating what was examined and what was not found, using the shift start timestamp. When follow-ups are empty, it records what was reviewed (condition count, handover note length) and that nothing was identified. Downstream readers can always distinguish “nothing to report” from “the system failed to populate this section.” These fixes brought compliance from 47% to 73%, then to 100%.
| Metric | Monolithic | Modular (initial) | Modular (current) |
|---|---|---|---|
| Format valid | 100% | 100% | 100% |
| Schema compliance | 0% | 47% | 100% |
| Deterministic fields match source | Unverified | 100% (tested) | 100% (tested) |
| LLM calls per summary | 1 (large) | 2 (targeted) | 2 (targeted) |
A final evaluation run confirmed model portability: schema compliance held at 100% when switching foundational models, validating that the deterministic/AI split is model-agnostic.
The transferable insights are the method (correlate gaps against input properties rather than guessing from output) and the principle (document absence rather than fabricate content). Each evaluation cycle tightened the system through architectural refinement guided by evidence.
Conclusion: From Prototype to System Guarantees
The most common objection to this kind of work is: “We are just prototyping. We will harden later.” That framing assumes speed and rigor are mutually exclusive. HVE is what breaks that tradeoff.
The single-prompt prototype took days. The modular pipeline took about the same, because the schema validation, field mapping analysis, and evaluation pipeline were not hardening steps bolted on after shipping. They were instruments used while building to understand what the system was actually doing.
The Responsible AI properties we identified early became verifiable system guarantees. Required sections exist by construction. Every AI judgment traces to a specific input record via log_entry_id. Failures are explicit, never silent. These are engineering properties that emerged from designing clear boundaries between deterministic and probabilistic components from the start.
The question was never “should we use AI?” We used it extensively: for routing, for classification, for evidence mapping, for reasoning under ambiguity. The question was where it belongs in the system. Evaluation told us. And the system we built reflects that answer.
Attribution
The featured image was generated using Copilot.