# DataMorgana Overview

DataMorgana is an innovative tool designed to generate diverse and customizable synthetic benchmarks for Retrieval-Augmented Generation (RAG) systems. Its key innovation lies in its ability to create highly varied question-answer pairs that more realistically represent how different types of users might interact with a system.

The tool operates through two main stages: **configuration** and **generation**.

---
## Configuration Stage
The configuration stage allows for the definition of detailed categorizations and associated categories for both questions and end-users, which provide high-level information on the expected traffic of the RAG application. A **categorization** is a list of mutually exclusive question or user categories along with their desired distribution within the generated benchmark. 

For example, a question categorization might include **search queries vs. natural language questions**, while a user categorization might include **novice vs. expert users**. There can be as many categorizations of questions and users as needed, and they can be easily defined to address the specific requirements of the applicative scenario. For instance:
- In a **healthcare RAG application**, a user categorization could consist of **patient, doctor, and public health authority**.
- In a **RAG-based embassy chatbot**, a categorization might include **diplomat, student, worker, and tourist**.

---

## Generation Stage
At the generation stage, DataMorgana leverages state-of-the-art **LLMs** (e.g., Claude 3.5 Sonnet) to incrementally build a benchmark of Q&A pairs. Each pair is generated by following the procedure depicted in **Figure 1**.

![Fig. 1](DM_gen_proc_fig.png)


<center><b>Fig. 1: DataMorgana Generation Stage</b> In the configuration, we provide an end-user categorization and two question categorizations, namely question formulations and question types.</center>

More specifically, the DataMorgana generation process follows these steps:

1. **Category Selection:**
   - It selects a **user/question category** for each categorization according to the probability distributions specified in the configuration file.
   - These are automatically combined to create a unique prompt.

2. **Document Selection:**
   - It randomly selects **documents** from the target corpus and adds them to the prompt.

3. **Question-Answer Generation:**
   - The chosen **LLM** is invoked with the instantiated prompt to generate **𝑘 candidate question-answer pairs** about the selected documents.

4. **Filtering and Verification:**
   - A final filtering stage verifies that these candidate pairs:
     - Adhere to the specified **categories**.
     - Are **faithful** to the selected documents.
     - Satisfy general constraints (e.g., be **context-free**).
   - If multiple pairs satisfy the quality requirements, **one is sampled**.

---

## Key Advantages
The rich and easy-to-use configurability of DataMorgana allows for **fine-grained control** over question and user characteristics. Furthermore, by jointly using multiple categorizations, DataMorgana can achieve a **combinatorial number of possibilities** to define Q&A pairs. This leads to more **diverse benchmarks** compared to existing tools that typically use a predefined list of possible question types.

Further details about DataMorgana, as well as **experimental results demonstrating its superior diversity**, are available in this [paper](Generating_Diverse_Q&A_Benchmarks_for_RAG_Evaluation_with_DataMorgana.pdf).