# DataMorgana Overview DataMorgana is an innovative tool designed to generate diverse and customizable synthetic benchmarks for Retrieval-Augmented Generation (RAG) systems. Its key innovation lies in its ability to create highly varied question-answer pairs that more realistically represent how different types of users might interact with a system. The tool operates through two main stages: **configuration** and **generation**. --- ## Configuration Stage The configuration stage allows for the definition of detailed categorizations and associated categories for both questions and end-users, which provide high-level information on the expected traffic of the RAG application. A **categorization** is a list of mutually exclusive question or user categories along with their desired distribution within the generated benchmark. For example, a question categorization might include **search queries vs. natural language questions**, while a user categorization might include **novice vs. expert users**. There can be as many categorizations of questions and users as needed, and they can be easily defined to address the specific requirements of the applicative scenario. For instance: - In a **healthcare RAG application**, a user categorization could consist of **patient, doctor, and public health authority**. - In a **RAG-based embassy chatbot**, a categorization might include **diplomat, student, worker, and tourist**. --- ## Generation Stage At the generation stage, DataMorgana leverages state-of-the-art **LLMs** (e.g., Claude 3.5 Sonnet) to incrementally build a benchmark of Q&A pairs. Each pair is generated by following the procedure depicted in **Figure 1**. ![Fig. 1](DM_gen_proc_fig.png)
Fig. 1: DataMorgana Generation Stage In the configuration, we provide an end-user categorization and two question categorizations, namely question formulations and question types.
More specifically, the DataMorgana generation process follows these steps: 1. **Category Selection:** - It selects a **user/question category** for each categorization according to the probability distributions specified in the configuration file. - These are automatically combined to create a unique prompt. 2. **Document Selection:** - It randomly selects **documents** from the target corpus and adds them to the prompt. 3. **Question-Answer Generation:** - The chosen **LLM** is invoked with the instantiated prompt to generate **𝑘 candidate question-answer pairs** about the selected documents. 4. **Filtering and Verification:** - A final filtering stage verifies that these candidate pairs: - Adhere to the specified **categories**. - Are **faithful** to the selected documents. - Satisfy general constraints (e.g., be **context-free**). - If multiple pairs satisfy the quality requirements, **one is sampled**. --- ## Key Advantages The rich and easy-to-use configurability of DataMorgana allows for **fine-grained control** over question and user characteristics. Furthermore, by jointly using multiple categorizations, DataMorgana can achieve a **combinatorial number of possibilities** to define Q&A pairs. This leads to more **diverse benchmarks** compared to existing tools that typically use a predefined list of possible question types. Further details about DataMorgana, as well as **experimental results demonstrating its superior diversity**, are available in this [paper](Generating_Diverse_Q&A_Benchmarks_for_RAG_Evaluation_with_DataMorgana.pdf).