DataMorgana Overview

DataMorgana is an innovative tool designed to generate diverse and customizable synthetic benchmarks for Retrieval-Augmented Generation (RAG) systems. Its key innovation lies in its ability to create highly varied question-answer pairs that more realistically represent how different types of users might interact with a system.

The tool operates through two main stages: configuration and generation.

Configuration Stage

The configuration stage allows for the definition of detailed categorizations and associated categories for both questions and end-users, which provide high-level information on the expected traffic of the RAG application. A categorization is a list of mutually exclusive question or user categories along with their desired distribution within the generated benchmark.

For example, a question categorization might include search queries vs. natural language questions, while a user categorization might include novice vs. expert users. There can be as many categorizations of questions and users as needed, and they can be easily defined to address the specific requirements of the applicative scenario. For instance:

In a healthcare RAG application, a user categorization could consist of patient, doctor, and public health authority.
In a RAG-based embassy chatbot, a categorization might include diplomat, student, worker, and tourist.

Generation Stage

At the generation stage, DataMorgana leverages state-of-the-art LLMs (e.g., Claude 3.5 Sonnet) to incrementally build a benchmark of Q&A pairs. Each pair is generated by following the procedure depicted in Figure 1.

Fig. 1: DataMorgana Generation Stage In the configuration, we provide an end-user categorization and two question categorizations, namely question formulations and question types.

More specifically, the DataMorgana generation process follows these steps:

Category Selection:
- It selects a user/question category for each categorization according to the probability distributions specified in the configuration file.
- These are automatically combined to create a unique prompt.
Document Selection:
- It randomly selects documents from the target corpus and adds them to the prompt.
Question-Answer Generation:
- The chosen LLM is invoked with the instantiated prompt to generate 𝑘 candidate question-answer pairs about the selected documents.
Filtering and Verification:
- A final filtering stage verifies that these candidate pairs:
  - Adhere to the specified categories.
  - Are faithful to the selected documents.
  - Satisfy general constraints (e.g., be context-free).
- If multiple pairs satisfy the quality requirements, one is sampled.

Key Advantages

The rich and easy-to-use configurability of DataMorgana allows for fine-grained control over question and user characteristics. Furthermore, by jointly using multiple categorizations, DataMorgana can achieve a combinatorial number of possibilities to define Q&A pairs. This leads to more diverse benchmarks compared to existing tools that typically use a predefined list of possible question types.

Further details about DataMorgana, as well as experimental results demonstrating its superior diversity, are available in this paper.