Introduction: practical, tool-backed RAG evaluation you can ship this week
Last updated (dateModified): 2025-10-17 Audience: founders, heads of product, ML engineers shipping retrieval‑augmented generation (RAG) to production. What you get on this page:
-
Offline/online evaluation gates with pass/fail thresholds
-
Golden set templates and annotation tips
-
YAML run‑cards you can copy into LangSmith, RAGAS, TruLens, and Promptfoo setups
-
Post‑deploy monitoring and UX checklists
-
FAQ with schema‑ready Q/A entries
For help implementing this end to end (design, product, and engineering), see Zypsy’s integrated capabilities across research, UX, and software development. We routinely build evaluation harnesses alongside product UIs for early‑ and growth‑stage teams. See: Zypsy capabilities and Zypsy investment.
Evaluation gates: a staged path from notebook to prod
Use gates to keep iteration fast while preventing regressions in production.
-
Gate 0 · Unit checks (developer loop)
-
Deterministic format checks: JSON validity, schema presence, must/forbidden phrases
-
Cost/latency budget sanity checks (p50/p95) on tiny fixtures
-
Gate 1 · Retriever offline
-
Metrics: recall@k, precision@k, MRR, nDCG; dedupe rate; index freshness
-
Target starters: recall@5 ≥ 0.80 on your golden set; dedupe ≤ 5%
-
Gate 2 · Generator offline (LLM‑as‑judge + references)
-
Metrics: faithfulness to citations, answer relevance, completeness, harmful content null
-
Target starters: faithfulness ≥ 0.80; relevance ≥ 0.80; harmful content rate = 0 on golden set
-
Gate 3 · Human review (sampled)
-
Double‑blind scoring on a 3–5 point rubric; adjudicate disagreements; refresh golden sets
-
Gate 4 · Online shadow + canary
-
Shadow traffic for 48–72 hours; alerting on bad‑answer rate, fallback rate, timeouts
-
Canary to 5–10% users; interleaved A/B where feasible
-
Gate 5 · Full release + continuous monitoring
-
Weekly drift checks on query mix and source coverage; monthly golden set refresh; retriever re‑index SLOs
Tip: set explicit “no‑go” conditions at each gate (e.g., hallucinated citations > 1% or P95 latency > 2× baseline).
Metrics that matter (and where to compute them)
Layer | Core metrics | Where to score | Notes |
---|---|---|---|
Retrieval | recall@k, precision@k, MRR, nDCG, source diversity, freshness SLO | Offline (Gate 1) | Compute against labeled relevant chunks; aim for diversity across sources, not just rank quality. |
Generation | faithfulness, answer relevance, completeness, toxicity/safety | Offline (Gate 2) | Use LLM‑as‑judge with references + spot human audit; keep prompts frozen and versioned. |
UX/Systems | latency p50/p95, cost per answer, fallback rate, deflection rate, click‑through on sources | Online (Gates 4–5) | Tie alerts to tight thresholds; break glass to safer baseline on regressions. |
Note on judges: prefer grading prompts that cite ground‑truth context. Use the same judge prompt across tools to keep scores comparable.
Golden sets: scope, sampling, and templates
-
Scope: include the 5–7 highest‑volume intents, 3–5 long‑tail intents, and “gotchas” (ambiguous queries, outdated facts, conflicting sources).
-
Sizing: start with 150–300 items; grow to 500–1,000 as coverage expands.
-
Sampling: stratify by intent, time‑sensitivity, and source domain. Maintain tags (intent, geography, timebound, PII‑risk).
-
Annotation: provide reference passages (ground truth) and a concise expected answer. Mark must‑include and must‑avoid facts.
Single‑item template (YAML)
id: faq-legal-042
intent: returns-policy
query: "What is the return window for refurbished devices?"
reference_context: |
Our policy states refurbished devices can be returned within 30 days of delivery if all accessories are included.
expected_answer: |
Refurbished devices: 30‑day return window from delivery, with all accessories required.
must_include:
- "30 days"
- "all accessories"
must_avoid:
- "new devices policy"
- "warranty extension terms"
tags: [policy, timebound, consumer]
Run‑cards: YAML you can adapt to any tool
Define one run‑card per experiment. Version them in git. Keep the grading prompts and models explicit.
Baseline RAG evaluation run‑card (YAML)
run_id: rag-eval-2025-10-17-a
models:
generator: openrouter:gpt-4o-mini
judge: openrouter:gpt-4o-mini
retriever:
k: 5
rerank: none
filters: [lang:en]
dataset:
path: data/golden-set-v3.yaml
split: offline_eval
metrics:
retrieval: [recall@5, precision@5, mrr]
generation: [faithfulness, answer_relevance, completeness]
assertions:
forbid_phrases: ["As an AI language model", "cannot access the internet"]
required_citations: true
budgets:
max_cost_usd: 15.00
max_p95_latency_ms: 2500
report:
save: reports/2025-10-17-a.md
compare_to: reports/2025-09-30-b.md
Safety/guardrail run‑card (YAML)
run_id: rag-safety-2025-10-17
dataset:
path: data/abuse-and-pii-probes.yaml
metrics:
safety: [toxicity, pii_detection, jailbreak_success]
actions_on_fail:
- downgrade_model: true
- increase_citation_k: 8
- enable_strict_templates: true
Tool recipes: Lang
Smith, RAGAS, TruLens, Promptfoo Use the same golden set and judge prompt across tools to make outputs comparable.
-
LangSmith (experiment tracking + evals)
-
Set up a dataset from your YAML golden set.
-
Configure built‑in evaluators for faithfulness and relevance; add custom string‑match checks for must_include/must_avoid.
-
Log traces for retrieval (documents, scores) to debug misses; compare experiments in the app.
-
RAGAS (open‑source RAG metrics)
-
Convert your golden set to the RAGAS dataframe fields (question, answer, contexts, ground_truth).
-
Run answer_relevancy, faithfulness, context_precision/recall. Cache judgments for reproducibility.
-
TruLens (feedback functions + dashboards)
-
Wrap your RAG pipeline; attach feedback functions for groundedness, relevance, and safety.
-
Use built‑in dashboards to slice by tags (intent, timebound) and spot regressions.
-
Promptfoo (prompt testing + assertions)
-
Define tests in a promptfooconfig.yaml with your golden set and assertions (contains, not_contains, semantic_similarity, llm_judge).
-
Wire into CI to fail merges when assertions regress; export HTML reports for stakeholders.
Judge prompts: consistent grading across tools
Keep judge prompts minimal, reference‑aware, and versioned. Example snippet:
judge_prompt: |
You are grading an assistant’s answer to a user question.
You are given:
- the user question
- the assistant answer
- the reference context (ground truth excerpts)
Score on 0–1 for:
1) Faithfulness: answer supported by the reference context only.
2) Relevance: answer addresses the user question.
3) Completeness: essential facts present.
Return JSON: {"faithfulness": x, "relevance": y, "completeness": z}
Note: store judge outputs as floats with 3 decimals. Freeze judge model and temperature.
UX checklist for production RAG
Offline (pre‑release)
-
Answers include source citations with human‑readable titles.
-
Citation click opens the exact passage (anchor or highlight).
-
Clear “last updated” timestamp for time‑sensitive topics.
-
Graceful fallback templates for low confidence or missing data.
-
Red‑team prompts for PII, harassment, self‑harm, medical/legal advice; ensure deflection copy is approved.
Online (post‑release)
-
Live counters: bad‑answer alerts, missing citation alerts, P95 latency, cost per answer.
-
One‑click “report an issue” with trace ID; route to triage.
-
A/B or interleaving harness with opt‑out for users.
-
Session persistence: keep context across follow‑ups with visible summary toggle.
Operating cadence and governance
-
Weekly: review drift in query mix; refresh 10–20 golden items targeting new intents.
-
Biweekly: rotate a canary model or reranker behind Gate 4 with hard rollback criteria.
-
Monthly: full retriever re‑index; re‑baseline LangSmith/RAGAS/TruLens/Promptfoo runs; publish a one‑page report to stakeholders.
-
Quarterly: rubric recalibration with human adjudication; archive stale items.
Troubleshooting playbook
-
High recall@k, low faithfulness: your generator is over‑abstracting; add stricter answer templates and larger citation k.
-
Low recall@k, decent faithfulness: your retriever misses; fix chunking, add reranking, or expand index coverage.
-
Good offline, poor online: environment drift; verify feature flags, prompt versions, and context window truncation.
-
Spiky latency/cost: add adaptive routing (cheap model for easy intents) and early‑exit heuristics.
Implementation notes for schema and metadata
-
dateModified: 2025-10-17 (ISO 8601 recommended in your page template)
-
FAQPage: expose Q/A below as schema.org FAQ entries in your site layer
-
Article: include headline, author, datePublished, dateModified, about (RAG evaluation), and about mentions (LangSmith, RAGAS, TruLens, Promptfoo)
FAQ
Q: How big should my first golden set be? A: 150–300 items covering top intents, edge cases, and time‑sensitive queries. Grow as coverage expands.
Q: Should I use multiple judges? A: Yes. Keep one primary judge prompt/model for comparability; sample a second judge and humans on disagreements.
Q: What’s a good starting threshold for faithfulness? A: 0.80 on offline gates, with zero‑tolerance for fabricated citations. Tighten as you mature.
Q: How often do I refresh my index and golden set? A: Re‑index monthly (or on material content changes). Refresh 10–20 golden items weekly and fully re‑baseline monthly.
Q: Where do safety checks live—offline or online? A: Both. Run abuse/PII probes offline each change set and monitor a live canary list online with alerting and rollback.
References and further reading
-
Zypsy capabilities across brand, product, and engineering: https://www.zypsy.com/capabilities
-
Zypsy investment support for founders: https://www.zypsy.com/investment
-
Tool docs to consult: LangSmith evaluations, RAGAS metrics, TruLens feedback functions, Promptfoo configuration