← Projects

RAG Engineering Playground

A modular environment for studying how different parts of a RAG pipeline affect final answer quality. Built to separate the effects of chunking, retrieval, reranking, and prompting and compare them cleanly.

RAG Engineering Playground started as an attempt to treat RAG engineering as an experimental problem, not just prompt tuning. I wanted to understand how different parts of a RAG system shape the final answer quality, and how often improvements in one layer fail to carry through to the answer the user actually sees.

That pushed the project toward a modular, almost plugin-like design. Most pipeline parts can be swapped through configuration: chunking strategy, retrieval mode, reranker, generation settings, and evaluation setup. The goal is not just flexibility for its own sake, but the ability to run controlled comparisons without constantly rewriting the system.

At the center of the project is a simple idea: request-level evidence should be captured once and reused later for offline evaluation and comparison. That makes it possible to study pipeline behavior more systematically, compare variants on the same evidence, and inspect where quality changes actually come from.

What makes the project interesting to me is that it treats RAG as something you can investigate layer by layer. Instead of asking only “did it answer correctly?”, it makes it easier to ask more useful questions:

  • did retrieval improve?
  • did that improvement survive reranking?
  • did better context actually lead to a better answer?
  • and which configuration changes matter most for the final result?

The current implementation includes multiple chunking strategies, dense and hybrid retrieval, optional reranking, offline eval runs, comparative reports, and observability for request-level diagnosis. The broader point is to make RAG behavior easier to inspect, compare, and reason about.

Some of the most interesting findings so far are not flashy ones. Retrieval gains do not always translate into answer-level gains. Different chunking strategies help different kinds of questions. And some design choices that look strong in isolation matter much less once you evaluate the whole pipeline end to end.

The evaluation approach is documented in detail in A Practical RAG Evaluation Workflow.