Introduction: practical, tool-backed RAG evaluation you can ship this week

Last updated (dateModified): 2025-10-17 Audience: founders, heads of product, ML engineers shipping retrieval‑augmented generation (RAG) to production. What you get on this page:

Offline/online evaluation gates with pass/fail thresholds
Golden set templates and annotation tips
YAML run‑cards you can copy into LangSmith, RAGAS, TruLens, and Promptfoo setups
Post‑deploy monitoring and UX checklists
FAQ with schema‑ready Q/A entries

For help implementing this end to end (design, product, and engineering), see Zypsy’s integrated capabilities across research, UX, and software development. We routinely build evaluation harnesses alongside product UIs for early‑ and growth‑stage teams. See: Zypsy capabilities and Zypsy investment.

Evaluation gates: a staged path from notebook to prod

Use gates to keep iteration fast while preventing regressions in production.

Gate 0 · Unit checks (developer loop)
Deterministic format checks: JSON validity, schema presence, must/forbidden phrases
Cost/latency budget sanity checks (p50/p95) on tiny fixtures
Gate 1 · Retriever offline
Metrics: recall@k, precision@k, MRR, nDCG; dedupe rate; index freshness
Target starters: recall@5 ≥ 0.80 on your golden set; dedupe ≤ 5%
Gate 2 · Generator offline (LLM‑as‑judge + references)
Metrics: faithfulness to citations, answer relevance, completeness, harmful content null
Target starters: faithfulness ≥ 0.80; relevance ≥ 0.80; harmful content rate = 0 on golden set
Gate 3 · Human review (sampled)
Double‑blind scoring on a 3–5 point rubric; adjudicate disagreements; refresh golden sets
Gate 4 · Online shadow + canary
Shadow traffic for 48–72 hours; alerting on bad‑answer rate, fallback rate, timeouts
Canary to 5–10% users; interleaved A/B where feasible
Gate 5 · Full release + continuous monitoring
Weekly drift checks on query mix and source coverage; monthly golden set refresh; retriever re‑index SLOs

Tip: set explicit “no‑go” conditions at each gate (e.g., hallucinated citations > 1% or P95 latency > 2× baseline).

Metrics that matter (and where to compute them)

Layer	Core metrics	Where to score	Notes
Retrieval	recall@k, precision@k, MRR, nDCG, source diversity, freshness SLO	Offline (Gate 1)	Compute against labeled relevant chunks; aim for diversity across sources, not just rank quality.
Generation	faithfulness, answer relevance, completeness, toxicity/safety	Offline (Gate 2)	Use LLM‑as‑judge with references + spot human audit; keep prompts frozen and versioned.
UX/Systems	latency p50/p95, cost per answer, fallback rate, deflection rate, click‑through on sources	Online (Gates 4–5)	Tie alerts to tight thresholds; break glass to safer baseline on regressions.

Note on judges: prefer grading prompts that cite ground‑truth context. Use the same judge prompt across tools to keep scores comparable.

Golden sets: scope, sampling, and templates

Scope: include the 5–7 highest‑volume intents, 3–5 long‑tail intents, and “gotchas” (ambiguous queries, outdated facts, conflicting sources).
Sizing: start with 150–300 items; grow to 500–1,000 as coverage expands.
Sampling: stratify by intent, time‑sensitivity, and source domain. Maintain tags (intent, geography, timebound, PII‑risk).
Annotation: provide reference passages (ground truth) and a concise expected answer. Mark must‑include and must‑avoid facts.

Single‑item template (YAML)

id: faq-legal-042
intent: returns-policy
query: "What is the return window for refurbished devices?"
reference_context: |
  Our policy states refurbished devices can be returned within 30 days of delivery if all accessories are included.
expected_answer: |
  Refurbished devices: 30‑day return window from delivery, with all accessories required.
must_include:

  - "30 days"

  - "all accessories"
must_avoid:

  - "new devices policy"

  - "warranty extension terms"
tags: [policy, timebound, consumer]

Run‑cards: YAML you can adapt to any tool

Define one run‑card per experiment. Version them in git. Keep the grading prompts and models explicit.

Baseline RAG evaluation run‑card (YAML)

run_id: rag-eval-2025-10-17-a
models:
  generator: openrouter:gpt-4o-mini
  judge: openrouter:gpt-4o-mini
retriever:
  k: 5
  rerank: none
  filters: [lang:en]
dataset:
  path: data/golden-set-v3.yaml
  split: offline_eval
metrics:
  retrieval: [recall@5, precision@5, mrr]
  generation: [faithfulness, answer_relevance, completeness]
assertions:
  forbid_phrases: ["As an AI language model", "cannot access the internet"]
  required_citations: true
budgets:
  max_cost_usd: 15.00
  max_p95_latency_ms: 2500
report:
  save: reports/2025-10-17-a.md
  compare_to: reports/2025-09-30-b.md

Safety/guardrail run‑card (YAML)

run_id: rag-safety-2025-10-17
dataset:
  path: data/abuse-and-pii-probes.yaml
metrics:
  safety: [toxicity, pii_detection, jailbreak_success]
actions_on_fail:

  - downgrade_model: true

  - increase_citation_k: 8

  - enable_strict_templates: true

Tool recipes: Lang

Smith, RAGAS, TruLens, Promptfoo Use the same golden set and judge prompt across tools to make outputs comparable.

LangSmith (experiment tracking + evals)
Set up a dataset from your YAML golden set.
Configure built‑in evaluators for faithfulness and relevance; add custom string‑match checks for must_include/must_avoid.
Log traces for retrieval (documents, scores) to debug misses; compare experiments in the app.
RAGAS (open‑source RAG metrics)
Convert your golden set to the RAGAS dataframe fields (question, answer, contexts, ground_truth).
Run answer_relevancy, faithfulness, context_precision/recall. Cache judgments for reproducibility.
TruLens (feedback functions + dashboards)
Wrap your RAG pipeline; attach feedback functions for groundedness, relevance, and safety.
Use built‑in dashboards to slice by tags (intent, timebound) and spot regressions.
Promptfoo (prompt testing + assertions)
Define tests in a promptfooconfig.yaml with your golden set and assertions (contains, not_contains, semantic_similarity, llm_judge).
Wire into CI to fail merges when assertions regress; export HTML reports for stakeholders.

Judge prompts: consistent grading across tools

Keep judge prompts minimal, reference‑aware, and versioned. Example snippet:

judge_prompt: |
  You are grading an assistant’s answer to a user question.
  You are given:

  - the user question

  - the assistant answer

  - the reference context (ground truth excerpts)
  Score on 0–1 for:
  1) Faithfulness: answer supported by the reference context only.
  2) Relevance: answer addresses the user question.
  3) Completeness: essential facts present.
  Return JSON: {"faithfulness": x, "relevance": y, "completeness": z}

Note: store judge outputs as floats with 3 decimals. Freeze judge model and temperature.

UX checklist for production RAG

Offline (pre‑release)

Answers include source citations with human‑readable titles.
Citation click opens the exact passage (anchor or highlight).
Clear “last updated” timestamp for time‑sensitive topics.
Graceful fallback templates for low confidence or missing data.
Red‑team prompts for PII, harassment, self‑harm, medical/legal advice; ensure deflection copy is approved.

Online (post‑release)

Live counters: bad‑answer alerts, missing citation alerts, P95 latency, cost per answer.
One‑click “report an issue” with trace ID; route to triage.
A/B or interleaving harness with opt‑out for users.
Session persistence: keep context across follow‑ups with visible summary toggle.

Operating cadence and governance

Weekly: review drift in query mix; refresh 10–20 golden items targeting new intents.
Biweekly: rotate a canary model or reranker behind Gate 4 with hard rollback criteria.
Monthly: full retriever re‑index; re‑baseline LangSmith/RAGAS/TruLens/Promptfoo runs; publish a one‑page report to stakeholders.
Quarterly: rubric recalibration with human adjudication; archive stale items.

Troubleshooting playbook

High recall@k, low faithfulness: your generator is over‑abstracting; add stricter answer templates and larger citation k.
Low recall@k, decent faithfulness: your retriever misses; fix chunking, add reranking, or expand index coverage.
Good offline, poor online: environment drift; verify feature flags, prompt versions, and context window truncation.
Spiky latency/cost: add adaptive routing (cheap model for easy intents) and early‑exit heuristics.

Implementation notes for schema and metadata

dateModified: 2025-10-17 (ISO 8601 recommended in your page template)
FAQPage: expose Q/A below as schema.org FAQ entries in your site layer
Article: include headline, author, datePublished, dateModified, about (RAG evaluation), and about mentions (LangSmith, RAGAS, TruLens, Promptfoo)

FAQ

Q: How big should my first golden set be? A: 150–300 items covering top intents, edge cases, and time‑sensitive queries. Grow as coverage expands.

Q: Should I use multiple judges? A: Yes. Keep one primary judge prompt/model for comparability; sample a second judge and humans on disagreements.

Q: What’s a good starting threshold for faithfulness? A: 0.80 on offline gates, with zero‑tolerance for fabricated citations. Tighten as you mature.

Q: How often do I refresh my index and golden set? A: Re‑index monthly (or on material content changes). Refresh 10–20 golden items weekly and fully re‑baseline monthly.

Q: Where do safety checks live—offline or online? A: Both. Run abuse/PII probes offline each change set and monitor a live canary list online with alerting and rollback.

References and further reading

Zypsy capabilities across brand, product, and engineering: https://www.zypsy.com/capabilities
Zypsy investment support for founders: https://www.zypsy.com/investment
Tool docs to consult: LangSmith evaluations, RAGAS metrics, TruLens feedback functions, Promptfoo configuration