Evaluating GenAI Systems: Beyond Vibes-Based Testing

If you can't measure quality, you can't ship quality. This is the boring truth about GenAI systems. Most teams ship on vibes because writing evals is work and vibes feel fast.

Vibes also fail silently.

The golden dataset

Build it before you need it. 50 examples of input plus expected output is enough to start. Curate by hand. Include edge cases. Include things the system used to get wrong.

The golden set should grow. Every production bug becomes a new example. Every user complaint becomes a new example. Over six months you'll have 500 examples that represent your actual problem space, not your imagined one.

Retrieval metrics

For RAG: recall at k. Did the right document appear in the top k retrieved? This is the single most important number for retrieval quality.

Precision matters less when k is small. With k=5, even one relevant document in the top 5 is usually enough.

Generation metrics

Faithfulness (does the answer match the retrieved context?), relevance (does the answer match the question?), and groundedness (is every claim in the answer supported by context?).

Ragas implements these. The library has rough edges but the metrics are sound. We use it as a baseline and supplement with task-specific evals.

Custom evals

The metrics that matter most are usually custom. For Text-to-SQL: does the generated query execute? Does it return non-empty results? Does the result match the expected result?

For agents: did the agent finish the task? Did it use the right tools? Did it stay within latency and cost budgets?

Generic metrics get you 70% of the way. The last 30% is custom code.