From Demo to Production: RAG Pipelines That Actually Scale

Most RAG tutorials end where the real problems begin. You load some docs, embed them, retrieve top-k, stuff them into a prompt, and ship. The demo works. The production system, three months later, is generating subtly wrong answers and nobody can tell you why.

Chunking is the foundation

Bad chunks produce bad answers. There is no model good enough to compensate for retrieval that grabs the wrong context.

What works: chunk by semantic unit (a section, a function, a step), not by character count. Preserve enough structure that a chunk is interpretable in isolation. Include the heading the chunk lives under. Include the document title.

A chunk that says "Press the red button twice" is useless. The same chunk with "From: Pipeline Restart Procedure / Step 3" attached is actionable.

Retrieval quality vs generation quality

These are different problems. You debug them differently.

If the retrieval is wrong, no model will save you. If the retrieval is right but the answer is wrong, you have a prompt problem or a model problem.

Always measure them separately. Recall at k for retrieval. Faithfulness for generation. If you only measure end-to-end correctness, you can't fix the right thing.

Hybrid search

Pure vector search loses to hybrid (vector plus BM25 plus rerank) for almost every enterprise use case I've shipped. Vector catches semantic similarity. BM25 catches exact terms (product codes, error IDs, names). The reranker resolves ties.

The downside is more moving parts. The upside is you stop seeing answers that say "I don't know" when the document literally has the exact phrase the user typed.

The eval pipeline you need

Build it before you need it. The eval pipeline runs every time a prompt changes, a model changes, or a chunking strategy changes. It compares output against a golden set you curate by hand.

If your eval set is small (50 examples), that's fine. Small evals you actually run are infinitely better than large evals you never get around to running. Grow it as failure modes appear.

Hard lessons

Most RAG quality problems are retrieval problems. Most retrieval problems are chunking problems. Most chunking problems are people not wanting to read their own documents.

Read the documents. Look at the chunks. Verify the retrieval. Then look at the generation.