A Three-Layer Benchmarking Framework for AI Knowledge Assistants - Part 1: Methodology

Most AI benchmarks test whether a system can answer questions. Ours tests whether itcan be trusted to answer them in environments where a wrong answer has realconsequences.

We built a three-layer benchmarking framework for our AI knowledge assistant,powered by a proprietary retrieval architecture designed specifically forregulated industries.

Step 1 : Curate benchmark question set that comprehensively tests informationthroughout the knowledge base

Generic QA benchmarks don't reflect the complexity of real-world regulated documents.We curated questions that mirror the types of lookups users actually perform inproduction, including the ones most systems struggle witH:

  • Answers buried inside multi-column tables
  • Codes and identifiers that appear exactly once in dense policy manuals spanning hundreds of pages
  • Questions requiring synthesis across multiple sections
  • Acronym-heavy queries where general-purpose AI models have no context
  • Policy rules hidden inside forms, nested tables, and unstructured layouts
  • Generalsummarization questions

Step 2 : Measure retrieval separately from generation

This is where most benchmarks fall short. They only measure the final answer. But inregulated environments, correctness alone isn't sufficient. A right answerpulled from the wrong source can't be audited, can't be trusted to hold up ascontent changes, and masks the retrieval failures that will eventually surfaceas visible errors. Measuring retrieval and generation as separate layers is howyou tell the difference between a system that works and one that appears to.

We split the evaluation into two layers:

  1. Retrieval: Did the system find thecorrect page? Did it rank it first? What happens as you retrieve more or fewerpages? Key performance metrics include:
    • Multi-labelPrecision:What proportion of the predicted pages were properly sourced
    • Multi-label Recall: What proportion of theground truth pages appeared in the prediction
    • Multi-label F1: The harmonic mean ofmulti-label precision and recal

      These metrics are computed for eachindividual inference and aggregated across examples for evaluating overallgeneralizability.
  2. Generation: Given what was retrieved,how accurate, complete, and well-formatted was the answer?  We use an independent  AI model in an LLM-as-a-judgescheme, scoring each response against verified ground truth answers. The keymetrics are:
    • Accuracy: Does the response containthe correct factual answer, including exact codes, modifiers, and identifierswhere applicable
    • Completeness: Does it fully address allparts of the question, or only part of it
    • Hallucination Safety: Are all claims traceable to the source document, or does the response introduceinformation not supported by the source
    • Conciseness: Is the response lengthappropriate for the complexity of the question

      This separation matters because a system can retrieve the wrong page and stillgenerate a plausible-sounding answer. Worse, with a large enough contextwindow, the correct answer might appear somewhere in the retrieved materialeven when the retrieval itself was inaccurate. Both scenarios mask retrievalfailures. The first produces confident, well-formatted, completely wrongresponses. The second appears to work but breaks unpredictably as contentchanges or scales. In regulated environments, that’s the dangerous failuremode, and it’s invisible without layered evaluation.
Step 3 : Evaluate user experience metrics

Retrieval and generation quality tell you if the system works. These metrics tell you ifit works in production.

  • Latency: How long does the end-to-end responsetake? We measure time-to-first-token separately from total generation time todistinguish retrieval latency from model latency.
  • Formatting Quality: Does theresponse structure match the complexity of the question? Clean one-line answersfor simple lookups. Structured tables and bullet points for multi-part policyquestions.
  • Cost: Good retrieval pays compounding dividendshere. When the retrieval layer consistently surfaces the right source material,the LLM doesn't need to compensate with size. That means smaller become viable, which in turn opens up self-hosting on reasonable GPU instances,or API-based deployments that consume significantly fewer tokens per query.Better retrieval directly expands your deployment options and reduces cost atscale.
Takeaways
  • Modular troubleshooting: Knowing where asystem breaks is as important as knowing that it breaks. This modularevaluation approach gives you that clarity.
  • Transparent evaluation: Every score is traceable toa specific question, retrieved source, and judge rationale. Stakeholders canaudit the results, not just trust them.
  • Swap-and-tune flexibility: Evaluate retrieval andgeneration independently; swap models or tune parameters and measure the impactin isolation.
  • Informed deployment decisions:If a smaller model matches frontier model accuracy on your domain and yourbenchmark proves it, you have the evidence to justify self-hosting orright-size your API spend.

In Part 2, we share the full results - what the numbers actually looked like,where each system broke down, and what the benchmark revealed about thetradeoffs between retrieval breadth, answer quality, and cost.

Background

Most AI benchmarks test whether a system can answer questions. Ours tests whether itcan be trusted to answer them in environments where a wrong answer has realconsequences.

We built a three-layer benchmarking framework for our AI knowledge assistant,powered by a proprietary retrieval architecture designed specifically forregulated industries.

Step 1 : Curate benchmark question set that comprehensively tests informationthroughout the knowledge base

Generic QA benchmarks don't reflect the complexity of real-world regulated documents.We curated questions that mirror the types of lookups users actually perform inproduction, including the ones most systems struggle witH:

  • Answers buried inside multi-column tables
  • Codes and identifiers that appear exactly once in dense policy manuals spanning hundreds of pages
  • Questions requiring synthesis across multiple sections
  • Acronym-heavy queries where general-purpose AI models have no context
  • Policy rules hidden inside forms, nested tables, and unstructured layouts
  • Generalsummarization questions

Step 2 : Measure retrieval separately from generation

This is where most benchmarks fall short. They only measure the final answer. But inregulated environments, correctness alone isn't sufficient. A right answerpulled from the wrong source can't be audited, can't be trusted to hold up ascontent changes, and masks the retrieval failures that will eventually surfaceas visible errors. Measuring retrieval and generation as separate layers is howyou tell the difference between a system that works and one that appears to.

We split the evaluation into two layers:

  1. Retrieval: Did the system find thecorrect page? Did it rank it first? What happens as you retrieve more or fewerpages? Key performance metrics include:
    • Multi-labelPrecision:What proportion of the predicted pages were properly sourced
    • Multi-label Recall: What proportion of theground truth pages appeared in the prediction
    • Multi-label F1: The harmonic mean ofmulti-label precision and recal

      These metrics are computed for eachindividual inference and aggregated across examples for evaluating overallgeneralizability.
  2. Generation: Given what was retrieved,how accurate, complete, and well-formatted was the answer?  We use an independent  AI model in an LLM-as-a-judgescheme, scoring each response against verified ground truth answers. The keymetrics are:
    • Accuracy: Does the response containthe correct factual answer, including exact codes, modifiers, and identifierswhere applicable
    • Completeness: Does it fully address allparts of the question, or only part of it
    • Hallucination Safety: Are all claims traceable to the source document, or does the response introduceinformation not supported by the source
    • Conciseness: Is the response lengthappropriate for the complexity of the question

      This separation matters because a system can retrieve the wrong page and stillgenerate a plausible-sounding answer. Worse, with a large enough contextwindow, the correct answer might appear somewhere in the retrieved materialeven when the retrieval itself was inaccurate. Both scenarios mask retrievalfailures. The first produces confident, well-formatted, completely wrongresponses. The second appears to work but breaks unpredictably as contentchanges or scales. In regulated environments, that’s the dangerous failuremode, and it’s invisible without layered evaluation.
Step 3 : Evaluate user experience metrics

Retrieval and generation quality tell you if the system works. These metrics tell you ifit works in production.

  • Latency: How long does the end-to-end responsetake? We measure time-to-first-token separately from total generation time todistinguish retrieval latency from model latency.
  • Formatting Quality: Does theresponse structure match the complexity of the question? Clean one-line answersfor simple lookups. Structured tables and bullet points for multi-part policyquestions.
  • Cost: Good retrieval pays compounding dividendshere. When the retrieval layer consistently surfaces the right source material,the LLM doesn't need to compensate with size. That means smaller become viable, which in turn opens up self-hosting on reasonable GPU instances,or API-based deployments that consume significantly fewer tokens per query.Better retrieval directly expands your deployment options and reduces cost atscale.
Takeaways
  • Modular troubleshooting: Knowing where asystem breaks is as important as knowing that it breaks. This modularevaluation approach gives you that clarity.
  • Transparent evaluation: Every score is traceable toa specific question, retrieved source, and judge rationale. Stakeholders canaudit the results, not just trust them.
  • Swap-and-tune flexibility: Evaluate retrieval andgeneration independently; swap models or tune parameters and measure the impactin isolation.
  • Informed deployment decisions:If a smaller model matches frontier model accuracy on your domain and yourbenchmark proves it, you have the evidence to justify self-hosting orright-size your API spend.

In Part 2, we share the full results - what the numbers actually looked like,where each system broke down, and what the benchmark revealed about thetradeoffs between retrieval breadth, answer quality, and cost.

Background

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

Situation

Situation

Solution

Results

Types of Journeys

Tech Stack