AI Lab: Evaluating your RAG - without ground truth reference data

RAGs, or in Pega-speak, Knowledge Agents (fka Knowlede Buddy), are awesome. Given a corpus (procedures, policies, service docs) you can spin up a domain-specific AI in record time, but how to evaluate the quality of a RAG, even in the absence of ground truth Q&A?

When designing a RAG you may get paralyzed by the sheer number of choices you can make: how to chunk the content, how much context to retrieve, what models and system prompts to use. Or go lives are delayed by all kinds of one-off questions where the answer might not be optimal.

But you would have to make design and configuration decisions based on a large volume of questions, not just a single one. Luckily there are automated evaluation methods, but these are mostly reference based so require good ground truth reference questions and answers. But you don’t have many of these at the start of your project - back to square 1. And even if you would, you’d probably want to add them to your corpus straight away, making them less suitable as a test for new questions.

So what if we would generate synthetic ‘ground truth’ reference questions and answers? Would we roughly rank RAGs configured differently in the same order of quality? Can we remove having ground truth data from the critical path, and let Knowledge Agents truly go viral?

This is what Jonas van Elburg set out to investigate in Pega’s AI Lab, supported by me and Maarten Marx (University of Amsterdam). Granted, this research was done already well over a year ago, and accepted for the SynDAiTE workshop at ECML PKDD 2025. Buth the wheels at Springer turn slowly, so earlier in May '26 finally the Springer proceedings came out.

Now I can imagine that most of you won’t have access to Springer, but we posted a preprint on Arxiv.

So what is the answer? We experimented with different parameters, for instance the number of search results (top k chunks) Knowledge Buddy should base its answer on. Interestingly enough, increasing k to 10 generally gave good results for the 4 corpora we used, with both ground truth data and synthetic reference questions and answers.

You would need to run an eval on your own data to see whether that’s true for your use cases as well, we can imagine this is dependent on the corpus, chunking and model. For experiments changing the generator model there was less alignment between curated and synthetic evals.

The reality is though that automated evals are almost a must for Knowledge Buddies to go viral, and as mentioned, ground truth reference data will always be limited, so complementing this with synthetic data is definitely a good idea - you can always compare experiments.

If you want to know more, give the paper a read, feedback and ideas welcome!

PS a previous version of this post appeared on my LinkedIn, follow me if you like this ind of stuff.