A Three-Layer Benchmarking Framework for AI Knowledge Assistants - Part 1: Methodology
Most AI benchmarks test whether a system can answer questions. Ours tests whether itcan be trusted to answer them in environments where a wrong answer has realconsequences.
We built a three-layer benchmarking framework for our AI knowledge assistant,powered by a proprietary retrieval architecture designed specifically forregulated industries.
Step 1 : Curate benchmark question set that comprehensively tests informationthroughout the knowledge base
Generic QA benchmarks don't reflect the complexity of real-world regulated documents.We curated questions that mirror the types of lookups users actually perform inproduction, including the ones most systems struggle witH:
- Answers buried inside multi-column tables
- Codes and identifiers that appear exactly once in dense policy manuals spanning hundreds of pages
- Questions requiring synthesis across multiple sections
- Acronym-heavy queries where general-purpose AI models have no context
- Policy rules hidden inside forms, nested tables, and unstructured layouts
- Generalsummarization questions
Step 2 : Measure retrieval separately from generation
This is where most benchmarks fall short. They only measure the final answer. But inregulated environments, correctness alone isn't sufficient. A right answerpulled from the wrong source can't be audited, can't be trusted to hold up ascontent changes, and masks the retrieval failures that will eventually surfaceas visible errors. Measuring retrieval and generation as separate layers is howyou tell the difference between a system that works and one that appears to.
We split the evaluation into two layers:
- Retrieval: Did the system find thecorrect page? Did it rank it first? What happens as you retrieve more or fewerpages? Key performance metrics include:
- Multi-labelPrecision:What proportion of the predicted pages were properly sourced
- Multi-label Recall: What proportion of theground truth pages appeared in the prediction
- Multi-label F1: The harmonic mean ofmulti-label precision and recal
These metrics are computed for eachindividual inference and aggregated across examples for evaluating overallgeneralizability.
- Generation: Given what was retrieved,how accurate, complete, and well-formatted was the answer? We use an independent AI model in an LLM-as-a-judgescheme, scoring each response against verified ground truth answers. The keymetrics are:
- Accuracy: Does the response containthe correct factual answer, including exact codes, modifiers, and identifierswhere applicable
- Completeness: Does it fully address allparts of the question, or only part of it
- Hallucination Safety: Are all claims traceable to the source document, or does the response introduceinformation not supported by the source
- Conciseness: Is the response lengthappropriate for the complexity of the question
This separation matters because a system can retrieve the wrong page and stillgenerate a plausible-sounding answer. Worse, with a large enough contextwindow, the correct answer might appear somewhere in the retrieved materialeven when the retrieval itself was inaccurate. Both scenarios mask retrievalfailures. The first produces confident, well-formatted, completely wrongresponses. The second appears to work but breaks unpredictably as contentchanges or scales. In regulated environments, that’s the dangerous failuremode, and it’s invisible without layered evaluation.
Step 3 : Evaluate user experience metrics
Retrieval and generation quality tell you if the system works. These metrics tell you ifit works in production.
- Latency: How long does the end-to-end responsetake? We measure time-to-first-token separately from total generation time todistinguish retrieval latency from model latency.
- Formatting Quality: Does theresponse structure match the complexity of the question? Clean one-line answersfor simple lookups. Structured tables and bullet points for multi-part policyquestions.
- Cost: Good retrieval pays compounding dividendshere. When the retrieval layer consistently surfaces the right source material,the LLM doesn't need to compensate with size. That means smaller become viable, which in turn opens up self-hosting on reasonable GPU instances,or API-based deployments that consume significantly fewer tokens per query.Better retrieval directly expands your deployment options and reduces cost atscale.
Takeaways
- Modular troubleshooting: Knowing where asystem breaks is as important as knowing that it breaks. This modularevaluation approach gives you that clarity.
- Transparent evaluation: Every score is traceable toa specific question, retrieved source, and judge rationale. Stakeholders canaudit the results, not just trust them.
- Swap-and-tune flexibility: Evaluate retrieval andgeneration independently; swap models or tune parameters and measure the impactin isolation.
- Informed deployment decisions:If a smaller model matches frontier model accuracy on your domain and yourbenchmark proves it, you have the evidence to justify self-hosting orright-size your API spend.
In Part 2, we share the full results - what the numbers actually looked like,where each system broke down, and what the benchmark revealed about thetradeoffs between retrieval breadth, answer quality, and cost.
.png)
