A Practical RAG Evaluation Workflow

This project started from a practical question: how do chunking, retrieval, reranking, generation, and their variants affect final answer quality?

That question is harder than it sounds. A retrieval metric can improve while the final answer stays flat. A chunking change can help one type of question and hurt another. A better-looking run can still leave it unclear which part of the pipeline actually changed the outcome.

The goal was not only to assemble a working pipeline, but to understand how different parts of that pipeline shape the final result. That led to a modular, configuration-driven pipeline where component variants could be swapped. It also led to request capture, so that each processed request could be stored for later evaluation.

I also built an offline eval engine that reads captured requests from the database, runs evaluation suites on them, aggregates the resulting metrics, and produces a run-level report for the whole batch. Together, that pipeline and eval setup made it possible to compare component variants across the same question set and get an initial picture of how the main parts of the system affected final answer quality.

Once that foundation was in place, it became possible to narrow the scope and investigate a more specific question in more detail. This article follows that path: from broad component comparison to a narrower look at how final answer quality varies by question type under different chunking strategies.

What this article is not

This is not a statistically conclusive benchmark. It is a practical eval workflow and a small repeated-run study used to make pipeline behavior more interpretable.

The Eval Surface

The evaluation setup in this project works at three levels: judge suites for request-level answer and chunk quality, retrieval metrics computed against golden targets, and run-level aggregates built from both.

Judge suites

suite	question it answers	scoring
`answer_completeness`	Did the answer cover the full substance of the question?	`1.0 / 0.5 / 0.0`
`groundedness`	Is the answer supported by the context selected for generation?	`1.0 / 0.5 / 0.0`
`answer_relevance`	Did the answer actually address the user’s question?	`1.0 / 0.5 / 0.0`
`correct_refusal`	If the system refused, was that refusal justified?	`1.0 / 0.0`
`retrieval_relevance`	Is this retrieved chunk useful for answering the question?	`1.0 / 0.5 / 0.0`

The first four suites judge the final answer. retrieval_relevance is different: it is applied to each retrieved chunk separately, so it gives a chunk-level view of retrieval quality.

Run-level metrics

metric family	examples	what it captures
answer-quality aggregates	`answer_completeness_mean`, `groundedness_mean`, `answer_relevance_mean`, `correct_refusal_rate`	Average answer quality across the run
chunk-quality aggregates	`retrieval_relevance_mean`, `retrieval_relevance_selected_mean`, `retrieval_relevance_weighted_topk_mean`	Average judged usefulness of retrieved chunks: across all retrieved chunks, only the selected context, or with extra weight on higher ranks
golden-target retrieval metrics	recall, `MRR`, `nDCG` for retrieved chunks; recall, `MRR`, `nDCG` for selected chunks	The same retrieval-quality metrics computed against the golden retrieval targets at two stages: first on the retrieved candidate set, then on the smaller chunk set actually passed to the generator
transition metric	`retrieval_context_loss`	The recall drop between retrieval output and final generation context
conditional aggregates	`groundedness_given_relevant_context`, `answer_completeness_given_relevant_context`, `success_rate_when_at_least_one_relevant_in_topk`	Answer quality measured only on requests where retrieval did supply relevant evidence

A few distinctions matter. retrieval_relevance_mean averages chunk-level judgments across all retrieved chunks, while retrieval_relevance_selected_mean keeps only the chunks that actually reached generation. retrieval_relevance_weighted_topk_mean emphasizes higher-ranked chunks, so it captures whether useful evidence appeared near the top, not just somewhere in the list.

The retrieval metrics in this section are all computed against the golden retrieval targets. They are applied twice: first to the retrieved candidate set, then to the smaller selected context that actually reaches the model. retrieval_context_loss summarizes the gap between those two stages.

The conditional aggregates help separate retrieval failure from downstream failure. For example, groundedness_given_relevant_context asks how grounded the answers were when retrieval had already supplied relevant evidence. success_rate_when_at_least_one_relevant_in_topk is stricter: a request counts as success only when the answer is both complete and grounded.

This separation helps move the comparison beyond “config A beat config B”:

retrieval indicators improved, but answer quality did not move much
the model had enough relevant context, but still failed to synthesize the answer
a chunker improved one category of questions and harmed another

Swappable Parts of the Pipeline

The main components could be varied independently across evaluation runs. The comparison surface is summarized below.

The model names in this table reflect the serving setup used in the project. The remote generation and judge models were open-weight models served through Together AI, where the openai/ prefix is part of the provider’s routing convention rather than a reference to OpenAI proprietary API models. The local Qwen model was used as a low-cost baseline, while the 20b and 120b remote variants made it possible to compare answer quality across different model sizes.

pipeline part	variant	description
`chunking`	`structural`	Each chunk is a section or subsection from the source.
	`fixed_in_structural`	Structural chunks split into fixed-size chunks on sentence boundaries, with `350` tokens and `15%` overlap.
	`fixed`	The full source split directly into fixed-size chunks on sentence boundaries, with `350` tokens and `15%` overlap.
`retriever`	`Dense`	Dense retrieval branch.
	`Hybrid bm25`	Hybrid branch with a BM25-style sparse signal.
	`Hybrid bow`	Hybrid branch with a bag-of-words sparse signal.
`reranker`	`PassThrough`	No learned reranking; retrieved order is preserved.
	`Heuristic`	Rule-based reranking with lexical and structure-aware signals.
	`CrossEncoder local`	`mixedbread-ai/mxbai-rerank-base-v2` served locally.
	`CrossEncoder remote`	Voyage AI `rerank-2.5`.
`generation`	local	`qwen2.5:1.5b-instruct`, rebuilt locally to accept a larger context window.
	remote	`openai/gpt-oss-20b` via Together.
	remote	`openai/gpt-oss-120b` via Together.
`judge`	remote	`openai/gpt-oss-20b` via Together.
	remote	`openai/gpt-oss-120b` via Together.

How Evaluation Runs Are Organized

Evaluation in this project is an offline workflow built on top of persisted request capture. The runtime stores each processed request in the database, and the eval engine later reads those captured requests instead of replaying the live pipeline.

Each eval run gets its own run_id and freezes its scope at the start. In practice, that means the engine first selects the eligible captured requests, writes that request set into the run manifest, and keeps it fixed for the life of the run. New requests that appear later are not silently absorbed into an already running evaluation.

The run then moves through three stages. judge_generation evaluates the final answer with the answer-level suites. judge_retrieval evaluates the retrieved chunks with the retrieval-relevance suite. build_request_summary combines those results with the original request capture and materializes one request-level summary row.

The engine writes its state back to the database after each step. Judge outputs, processing state, and run-scoped summaries are persisted as structured tables rather than kept only in logs. This makes the workflow resumable: if a run fails partway through, it can be restarted with the same run_id and continue from the last completed state instead of starting from scratch. The same persisted tables later serve as Grafana data sources for dashboard views and comparative analysis.

Once every request in the frozen run scope is completed, the eval engine builds a run-level report with aggregated metrics, label distributions, retrieval-quality summaries, and selected worst-case previews. A representative example is this run report.

Stage 1: Broad Comparative Evaluation

The first evaluation stage was a broad comparison across the pipeline variants described above. Its purpose was not to identify a final winner, but to get an initial read on which branches looked promising and which parts of the system seemed to matter most. The fuller comparison is documented in the 20-question comparative report.

At the chunking layer, this stage compared only two variants: structural and fixed. The third variant, fixed_in_structural, was introduced later as a follow-up hypothesis rather than part of the original comparison matrix.

Several patterns stood out. Hybrid bm25 looked like the strongest retrieval branch. PassThrough turned out to be a stronger baseline than expected, with reranking often improving ranking-oriented metrics more clearly than final answer quality. Dense retrieval remained competitive in several configurations, which made the retrieval story more interaction-dependent than a simple dense-versus-hybrid ranking. More broadly, retrieval-side improvements did not consistently translate into better final answers.

These findings were useful, but they were still directional.

The benchmark used a 20-question golden set.
Many configurations had only one run.
Full-dataset averages were too coarse to explain why some variants succeeded or failed.

That became the main motivation for the next step: narrowing the scope and asking more specific questions about where the differences were actually coming from.

The Next Question: Do Different Question Types Behave Differently?

The broad comparison was useful, but full-dataset averages were too coarse to explain where the observed differences were actually coming from. A follow-up tagged analysis showed that the 20-question golden set was not uniform: it mixed several recurring question types that behaved differently.

The main tag families were:

tag	what it asks	example
`causal`	why something works or fails	Why are Lamport logical clocks not sufficient to capture causality precisely, and how do vector clocks improve on them?
`contrast`	how two concepts differ	What is the difference between flow control and congestion control in TCP, and what type of overload is each designed to prevent?
`tradeoff`	what benefit costs something else	How does DNS TTL create a trade-off between fast propagation and lower lookup load?
`failure`	what breaks under failure and what mitigates it	Why can retrying a POST request after a timeout create inconsistent state, and how do idempotency keys help?

This mattered because aggregate averages across the whole dataset could hide category-specific behavior. A configuration that looked strong overall could still be weak on one type of question and strong on another. Once that became visible, the next step was to narrow the scope and look more carefully at one part of the pipeline.

Stage 2: A Narrower Diagnostic Study

Once the tagged analysis made category-level differences visible, the next step was to narrow the scope. Instead of extending the whole comparison matrix, I isolated one part of the pipeline and studied it more carefully.

Chunking became the focus for two reasons. First, the Stage 1 results suggested that chunking already mattered, but the comparison there covered only structural and fixed. Second, a third variant, fixed_in_structural, introduced a more specific hypothesis: preserve structural boundaries, but split more finely inside them.

To keep the comparison interpretable, the retrieval branch was fixed to Hybrid bm25, which had looked strongest in the earlier broad comparison. The reranker was fixed to PassThrough, because the first-stage results did not show a stable answer-level advantage from reranking and I wanted to avoid adding another moving part.

The Stage 2 comparison therefore focused on three chunking variants:

structural
fixed
fixed_in_structural

All three were evaluated on the same 20-question golden set under the same retrieval branch and the same no-reranking setup. The max_context_chunks setting was not held constant across all three variants: structural used 4, while fixed and fixed_in_structural used 6. That was a deliberate choice, because comparing the same number of large structural chunks and much smaller fixed-size chunks would have distorted the effective context budget.

Each configuration was run three times. One run was too fragile to trust on its own, but a much larger repeated-run design would have increased cost without changing the scope of the question very much. Three runs were enough to get a coarse average, a basic sense of variability, and a directional stability check without pretending to provide strong statistical certainty.

Stage 2: Results

Across three runs per configuration, fixed_in_structural emerged as the strongest overall chunking variant on end-to-end answer quality. fixed remained strongest on chunk-level retrieval relevance, while structural stayed a solid baseline but showed the noisiest generation profile. The detailed repeated-run record is collected in the Stage 2 chunking comparison notes. That was enough to support a directional conclusion about the broad ranking, even if the benchmark remained too small for stronger statistical claims.

The most useful overall result was not just which variant came out ahead, but how the profiles differed. fixed consistently looked strong when retrieved chunks were judged one by one, while fixed_in_structural more often produced the best final answers. structural remained competitive, but less stable across repeated runs.

By Question Type

Contrast questions responded best to fixed. That suggests that smaller local chunk boundaries can be especially helpful when the task depends on explicit lexical distinctions and tightly paired concepts.

Causal questions were strongest under fixed_in_structural. In this slice, preserving larger structural units while still splitting them more finely appeared to help the model assemble longer explanatory chains.

Failure questions also looked strongest under fixed_in_structural. This suggested that these questions were not only generation-sensitive; they also depended on how failure states, causes, and mitigations were packaged into chunks.

Tradeoff remained the most mixed category. It did not produce as clean a winner as contrast, and it did not point as clearly in one direction as causal and failure.

Taken together, these results made the Stage 2 conclusion more specific than “one chunker won.” Different chunking strategies appeared to help different question types, and fixed_in_structural looked strongest overall because it produced the best balance across the full set rather than dominating every slice. In practical terms, that was the main payoff of the narrower follow-up: it turned a broad preference into a more interpretable explanation of where that preference seemed to come from.

The point is not that one chunker is universally better. The point is that chunking strategy should be evaluated against the kinds of questions the system is expected to answer.

Conclusion

The main value of this evaluation setup was not that it produced a single leaderboard. It made it possible to compare pipeline variants on the same request set, preserve the results as stable offline eval runs, and then narrow the scope once the broad comparison stopped being informative enough.

A few practical lessons stood out:

Broad comparison is useful for orientation, but explanation usually comes from a narrower follow-up. The first stage identified promising branches; the second stage made one chunking question more interpretable.
Retrieval-side metrics are not enough on their own. In this project, fixed often looked stronger at the chunk level, while fixed_in_structural more often produced the best final answers.
Full-dataset averages can hide category-specific behavior. Once the question set was examined by tag, it became much easier to see that different chunking strategies helped different kinds of questions.
Repeated runs matter. Three runs were still small, but they gave a more trustworthy directional read than a single run would have.

Within the limits of a small 20-question golden set, that was enough to make the system easier to reason about. The most useful outcome was not just a preferred chunking variant, but a clearer picture of which parts of the pipeline changed final answer quality, and how those effects varied across question types.