CASE STUDYEXECUTE

Building a Self-Healing AI Inference Pipeline

2026-02-16Omar Trejo PDF

A production AI platform ran inference across multiple model providers for each incoming record. The pipeline worked under normal conditions. When a model call failed — transient network error, provider timeout, partial response — it did not fail cleanly. Records stuck in "Processing" indefinitely. Retry logic called the model again, the billing system charged twice, and two conflicting results existed for the same input. Operators spent hours each week manually reconciling stuck records and investigating duplicate charges.

ML LABS engineered reliability into the inference layer so that every failure mode recovered automatically — no duplicate billing, no stuck records, no manual reconciliation.

The Failure Modes

The cascading failures traced back to a single architectural gap: the system had no recovery semantics for model inference. The chain looked like this:

A model returned a valid result, but the record's status update failed
The record appeared stuck even though the work was done
Retries re-invoked the model, generating a duplicate charge
Conflicting results for the same input with no clear authority

Eliminating Duplicate Billing

The core fix decoupled billing from processing. Previously, every model call generated a charge — whether the result was ultimately used or discarded. ML LABS redesigned billing to be tracked at the record level, so the system knows whether a record has already been charged before any model call executes. Retries that detect an existing charge skip the billing event entirely.

graph TD
    A1["Record enters<br/>processing"]
    B1["Check billing<br/>status"]
    C1{"Already<br/>charged?"}
    D1["Return existing<br/>result"]
    E1["Process and<br/>record charge"]
    F1["Write result<br/>atomically"]

    A1 --> B1
    B1 --> C1
    C1 -->|"Yes"| D1
    C1 -->|"No"| E1
    E1 --> F1

    style A1 fill:#1a1a2e,stroke:#0f3460,color:#fff
    style B1 fill:#1a1a2e,stroke:#ffd700,color:#fff
    style C1 fill:#1a1a2e,stroke:#ffd700,color:#fff
    style D1 fill:#1a1a2e,stroke:#16c79a,color:#fff
    style E1 fill:#1a1a2e,stroke:#0f3460,color:#fff
    style F1 fill:#1a1a2e,stroke:#16c79a,color:#fff

The retry behavior was also unified across all model providers. Previously, each provider implemented its own retry assumptions. ML LABS built a shared processing layer with consistent error handling and safeguards against concurrent retries racing into duplicate states.

The most expensive assumption in model inference is that a failed status update means the work was not done. In practice, the model often completed successfully — and retrying creates a second charge and a state contradiction that only manual investigation can resolve.

Testing Failure Recovery in CI

ML LABS built failure simulation directly into the platform so that every failure mode could be triggered on demand in any environment. The simulation infrastructure exercises the same code paths that production retries follow — not mocks, but controlled failures through the real pipeline.

This caught a class of bugs that unit tests could not reach. The ambiguous failure mode — model succeeds, status update fails, record appears stuck — required real infrastructure interactions to reproduce. With simulation harnesses, this scenario ran in the CI pipeline on every commit.

Reliable Async Processing

The synchronous architecture was the root fragility. A slow model call blocked the processing thread. A mid-chain failure left records in ambiguous states with no recovery path. ML LABS replaced this with an async architecture that provided:

Durability — work survives process crashes
Isolation — a slow model does not block new work
Observability — processing health is directly measurable

Records that exhaust retries are routed to investigation rather than being silently lost. Every record now reaches a terminal status regardless of whether processing succeeds, fails, or is unavailable.

Unified Ingestion

The platform accepted records through multiple paths: web uploads, network ingestion from clinic systems, and API integrations. Each path had its own error handling and its own assumptions about failure behavior.

ML LABS unified all ingestion paths so that every record, regardless of how it entered the system, flows through identical processing logic with identical reliability guarantees. This eliminated the class of bugs where "it works from one input path but fails from another."

When This Is Overkill

This level of reliability engineering adds infrastructure complexity. The overhead is not justified when:

Internal experimentation pipelines where reprocessing is acceptable
Model calls are stateless and cheap with no billing implications
Batch jobs that can tolerate occasional duplicates
Development or staging environments

The investment pays off when model calls are expensive, results feed stateful workflows, and duplicate processing creates real-dollar consequences.

First Steps

Instrument your failure rate. Measure how often model calls fail, how many records are stuck, and how many duplicate charges exist today.
Decouple billing from processing. Start with your highest-cost model. Ensure the system checks charge status before any model call executes.
Simulate the ambiguous failure. Trigger a model success with a failed status update and verify that the system recovers without duplicate charges.

Practical Solution Pattern

Decouple billing from processing so retries never generate duplicate charges. Replace synchronous inference with async processing that recovers automatically from failures. Unify all ingestion paths so reliability guarantees apply identically regardless of how records enter the system. Build failure simulation that exercises ambiguous failure modes in CI on every commit.

This works because reliability failures in AI inference are not model failures — they are integration failures in the gap between the model call and the state update. The architecture closes the double-charge gap, makes every processing step independently recoverable, and ensures no work is silently lost. If one defined AI workflow has reliability problems causing operational damage, AI Workflow Integration is the direct build path.