Orensomekh's picture
Rename Operational_Instructions/DM_Overview (1).md to Operational_Instructions/DM_Overview.md
3f9bccf verified
# DataMorgana Overview
DataMorgana is an innovative tool designed to generate diverse and customizable synthetic benchmarks for Retrieval-Augmented Generation (RAG) systems. Its key innovation lies in its ability to create highly varied question-answer pairs that more realistically represent how different types of users might interact with a system.
The tool operates through two main stages: **configuration** and **generation**.
---
## Configuration Stage
The configuration stage allows for the definition of detailed categorizations and associated categories for both questions and end-users, which provide high-level information on the expected traffic of the RAG application. A **categorization** is a list of mutually exclusive question or user categories along with their desired distribution within the generated benchmark.
For example, a question categorization might include **search queries vs. natural language questions**, while a user categorization might include **novice vs. expert users**. There can be as many categorizations of questions and users as needed, and they can be easily defined to address the specific requirements of the applicative scenario. For instance:
- In a **healthcare RAG application**, a user categorization could consist of **patient, doctor, and public health authority**.
- In a **RAG-based embassy chatbot**, a categorization might include **diplomat, student, worker, and tourist**.
---
## Generation Stage
At the generation stage, DataMorgana leverages state-of-the-art **LLMs** (e.g., Claude 3.5 Sonnet) to incrementally build a benchmark of Q&A pairs. Each pair is generated by following the procedure depicted in **Figure 1**.
![Fig. 1](DM_gen_proc_fig.png)
<center><b>Fig. 1: DataMorgana Generation Stage</b> In the configuration, we provide an end-user categorization and two question categorizations, namely question formulations and question types.</center>
More specifically, the DataMorgana generation process follows these steps:
1. **Category Selection:**
- It selects a **user/question category** for each categorization according to the probability distributions specified in the configuration file.
- These are automatically combined to create a unique prompt.
2. **Document Selection:**
- It randomly selects **documents** from the target corpus and adds them to the prompt.
3. **Question-Answer Generation:**
- The chosen **LLM** is invoked with the instantiated prompt to generate **π‘˜ candidate question-answer pairs** about the selected documents.
4. **Filtering and Verification:**
- A final filtering stage verifies that these candidate pairs:
- Adhere to the specified **categories**.
- Are **faithful** to the selected documents.
- Satisfy general constraints (e.g., be **context-free**).
- If multiple pairs satisfy the quality requirements, **one is sampled**.
---
## Key Advantages
The rich and easy-to-use configurability of DataMorgana allows for **fine-grained control** over question and user characteristics. Furthermore, by jointly using multiple categorizations, DataMorgana can achieve a **combinatorial number of possibilities** to define Q&A pairs. This leads to more **diverse benchmarks** compared to existing tools that typically use a predefined list of possible question types.
Further details about DataMorgana, as well as **experimental results demonstrating its superior diversity**, are available in this [paper](Generating_Diverse_Q&A_Benchmarks_for_RAG_Evaluation_with_DataMorgana.pdf).