Skip to main content

Why Most RAG Pipelines Fail in Production (and the Fix)

· 2 min read

Over the past few months, I had the opportunity to refine a RAG ingestion pipeline that governs how our AI assistant’s knowledge base evolves. It involves syncing content from S3 into OpenSearch, updating embeddings, and promoting new knowledge safely into production.

The core challenge wasn’t ingestion itself, but ensuring retrieval consistency against cross-query temporal hallucination due to stale context. In early iterations, the system would occasionally serve a mix of stale and fresh data, leading to subtle hallucinations where responses looked correct but contradicted each other across queries. This made failures hard to detect and even harder to debug.

My initial approach used a cron-triggered Lambda. It broke under real-world conditions: concurrent runs collided, partial failures corrupted the index, and transient errors required manual retries. A teammate spent 2–3 hours weekly recovering failed syncs and auditing responses. More importantly, traditional backend reliability metrics (job success/failure) didn’t capture the real problem, that is, semantic correctness.

I redesigned the workflow as an AWS Step Functions state machine that enforces mutual exclusion, queues overlapping runs, and performs atomic index promotion. Instead of incrementally updating the live index, the system builds a complete new snapshot and only switches the agent once the entire pipeline of [ ingestion, chunking, embedding ] succeeds. This trades off some freshness latency and compute cost for strong consistency, which proved to be the more critical lever for correctness.

To measure impact, we tracked stale-answer incidents using a mix of regression queries and manual audits, rather than relying solely on pipeline health metrics. This reduced inconsistent-answer cases by ~93% and eliminated routine manual intervention, allowing the team to shift focus from operational recovery to improving model quality.

If I were to rebuild it today, I’d extend the system with AI-native safeguards: an automated evaluation harness to detect semantic drift before promotion, versioned rollbacks for instant recovery, and embedding quality monitoring to catch degradation over time.

The key insight from this work is that in production RAG systems, data consistency often matters more than data freshness, and the ingestion pipeline itself becomes part of the model’s behavior surface, so it needs to be designed and evaluated with the same rigor as the model.