## PROGRAMMATIC ACCESS TO DATA MORGANA

**DataMorgana** is a powerful tool for generating synthetic question-answering data, useful for both evaluating and training question-answering systems.

If you're using DataMorgana for the first time, it's recommended to start with the [DataMorgana Sandbox](https://platform.ai71.ai/playground). The Sandbox provides an intuitive UI for generating individual question-answer pairs interactively.

In this notebook, we'll explore how to use the DataMorgana API to generate large-scale synthetic question-answering data on FineWeb.

For the full API documentation, refer to [this link](https://api.ai71.ai/redoc#tag/Synthetic-Conversations).

In [2]:
import json
import time
from typing import Dict, List

import requests

BASE_URL = "https://api.ai71.ai/v1/"

First, ensure that you have an API key for the AI71 platform.

In [None]:
API_KEY = ''

### How to know the remaining budget

The generation of the data is done using LLMs, which is costly. Therefore, you will have a limited amount of credits - each credit corresponds to a single generated question. 

You can use the `check_budget` endpoint to see the remaining credits for your organization.

In [3]:
def check_budget():
 resp = requests.get(
 f"{BASE_URL}check_budget",
 headers={"Authorization": f"Bearer {API_KEY}"},
 )
 resp.raise_for_status()
 print(json.dumps(resp.json(), indent=4))

In [4]:
check_budget()

{
 "remaining_budget": 9967
}


### Bulk generation of QA pairs

Now, let's see how to generate questions using the `bulk_generation` endpoint.

This endpoint accepts three arguments: `n_questions`, `question_categorizations`, and `user_categorizations`.

Since the endpoint is **asynchronous**, it returns only a `request_id`. To retrieve the generated questions once they are ready, we need to use the `fetch_generation_results` endpoint with the corresponding `request_id`.

In [5]:
def bulk_generate(n_questions: int, question_categorizations: List[Dict], user_categorizations: List[Dict]):
 resp = requests.post(
 f"{BASE_URL}bulk_generation",
 headers={"Authorization": f"Bearer {API_KEY}"},
 json={
 "n_questions": n_questions,
 "question_categorizations": question_categorizations,
 "user_categorizations": user_categorizations
 }
 )
 resp.raise_for_status()
 request_id = resp.json()["request_id"]
 print(json.dumps(resp.json(), indent=4))

 result = wait_for_generation_to_finish(request_id)
 return result


def wait_for_generation_to_finish(request_id: str):
 first_print = True
 while True:
 resp = requests.get(
 f"{BASE_URL}fetch_generation_results",
 headers={"Authorization": f"Bearer {API_KEY}"},
 params={"request_id": request_id},
 )
 resp.raise_for_status()
 if resp.json()["status"] == "completed":
 print('completed')
 print(json.dumps(resp.json(), indent=4))
 return resp.json()
 else:
 if first_print:
 first_print = False
 print("Waiting for generation to finish...", end='')
 else:
 print('.', end='')
 time.sleep(5)

### Definition of User and Question Categorizations

To call the `bulk_generation` endpoint, we first need to specify the user and question categorizations we want to use. 

When defining categorizations, keep in mind: 

- You can create your own categorizations—these are just examples. 
- Each categorization can include as many categories as you like, as long as their probabilities sum to 1. 
- The **descriptions** of the categories are injected into the LLM prompt during question generation. To ensure high-quality outputs, it’s important to write them clearly and thoughtfully. 

We encourage you to first try your configurations in the Sandbox before using them to generate a large bulk of questions, to ensure you get the expected results.

For the competition, you’ll want to evaluate and train your system on a diverse set of questions, since you won’t know in advance what types of questions will appear in the test. 

Keep in mind that the categorizations used in this notebook are just examples and will not correspond to those used to generate the actual test set.

Let's start by defining a user categorization.

In [6]:
user_expertise_categorization = {
 "categorization_name": "user-expertise",
 "categories": [
 {
 "name": "expert",
 "description": "an expert on the subject discussed in the documents, therefore he asks complex questions.",
 "probability": 0.5
 },
 {
 "name": "novice",
 "description": "a person with very basic knowledge on the topic discussed in the topic. Therefore, he asks very simple questions.",
 "probability": 0.5
 }
 ]
}

Similarly, we can define question categorizations.

In [7]:
question_formulation_categorization = {
 "categorization_name": "question-formulation",
 "categories": [
 {
 "name": "concise and natural",
 "description": "a concise direct natural question consisting of a few words.",
 "probability": 0.35
 },
 {
 "name": "verbose and natural",
 "description": "a relatively long question consisting of more than 9 words.",
 "probability": 0.35
 },
 {
 "name": "short search query",
 "description": ("phrased as a typed web query for search engines "
 "only keywords, without punctuation and without a natural-sounding structure)."
 " It consists of less than 7 words."),
 "probability": 0.15
 },
 {
 "name": "long search query",
 "description": ("phrased as a typed web query for search engines "
 "only keywords, without punctuation and without a natural-sounding structure)."
 " It consists of more than 6 words."),
 "probability": 0.15
 }
 ]
}

premise_categorization = {
 "categorization_name": "premise-categorization",
 "categories": [
 {
 "name": "without premise",
 "description": "a question that does not contain any premise or any information about the user.",
 "probability": 0.7
 },
 {
 "name": "with premise",
 "description": ("a question starting with a very short premise, where the users reveal "
 "their needs or some information about themselves."),
 "probability": 0.3
 }
 ]
}

### Generating questions from **document pairs**
DataMorgana supports the generation of questions where the information required to answer them is split across two documents.

To enable this possibility we need to use the `is_multi_doc` field which is applicable to question categories.

The `is_multi_doc` is by default `false`, and when explicitely set to `true`, it triggers data morgana to use two documents instead of one, while generating a question answer pair.

Note that the `is_multi_doc` field applies only to question categories, and not to user categories.

When writing the description for a multi-doc question category, it is important to clearly specify how the two documents are used to create the question.

Below is an illustrative example of a question categorization containing two question categories that are multi-doc, and one which is not.

In [8]:
answer_type_categorization = {
 "categorization_name": "answer-type",
 "categories": [
 {
 "name": "factoid",
 "description": "a question seeking a specific, concise piece of information or a short fact about a particular subject, such as a name, date, or number.",
 "probability": 0.2,
 "is_multi_doc": False
 },
 {
 "name": "multi-aspect",
 "description": ("A question about two different aspects of the same entity/concept. "
 "For example: 'What are the advantages of AI-powered diagnostics, and what are the associated risks of bias in medical decision-making?', "
 "'How do cryptocurrencies enable financial inclusion, and what are the security risks associated with them?'. "
 "The information required to answer the question needs to come from two documents, "
 "specifically, the first document must provide information about the first aspect, while the second must provide information about the second aspect."),
 "probability": 0.3,
 "is_multi_doc": True
 },
 {
 "name": "comparison",
 "description": ("a comparison question that requires comparing two related concepts or entities. "
 "The comparison must be natural and reasonable, i.e., comparing two entities by a common attribute which is meaningful and relevant to both entities. "
 "For example: 'Who is older, Glenn Hughes or Ross Lynch?', 'Are Pizhou and Jiujiang in the same province?', "
 "'Pyotr Ilyich Tchaikovsky and Giuseppe Verdi have this profession in common'. "
 "The information required to answer the question needs to come from two documents, specifically, "
 "the first document must provide information about the first entity/concept, while the second must provide information about the second entity/concept."),
 "probability": 0.5,
 "is_multi_doc": True
 }
 ]
}

### Calling the bulk_generation method and accessing the results

After defining the user and question categorizations we plan to use, we can actually call the the `bulk_generation` endpoint.

For example, let's use the previously defined categorizations to generate 2 question-answer pairs.

In [9]:
results = bulk_generate(n_questions=2,
 question_categorizations=[question_formulation_categorization, premise_categorization, answer_type_categorization],
 user_categorizations=[user_expertise_categorization]
 )

{
 "request_id": "5d27a4f3-4031-4952-9a86-937e767ad095",
 "type": "async"
}
Waiting for generation to finish..........completed
{
 "status": "completed",
 "file": "https://s3.amazonaws.com/data.aiir/data_morgana/web_api/results_id_a2376f40-3bdd-407e-8c76-0509d36d0629_user_id_430d2246-3067-4662-8ce5-0c29049adf42.jsonl?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2UC3AHBF3ZBGDG62%2F20250414%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250414T071348Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEIT%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJIMEYCIQC2vL1mHOpPS2ySz8T7WjQZ8X%2B%2FeqEV71GTmT6KUwto5AIhALOWNPfuk8zNBS8Fxt%2FzdpUnPOGbaIa9v4ZL4tnWRM%2FcKsMFCP3%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQABoMNzMwMzM1Mjk1NTYzIgyMt92WGbeylBLY%2BuYqlwU9BRf0SY1xaBr%2Fs%2FNxsTMa9MfykI%2BhDRLsvXrmfuiuCXmZZol1ocW8recxFWibFqBouhYPsFAkfmj5mB5jmXWAVlsKQOkFpM7ejJxxIvIKaIMO991disiSy%2FxsxNrzFMcRh3iQxML%2BIx0z1AHVkD1x%2BAL%2BvIlABJiDMllC1j649jdbSlMzVvsZEXLWUrI2IJGk4Zw3HE6iZ

The API response includes a link to the file where the generated results are stored. 
This link is valid for 12 hours. If you need to access the file after that time, you can call `fetch_generation_results` again with the original `request_id` to receive a new link.

Let's retrieve the data from the file and print the results.

In [10]:
response = requests.get(results["file"])
qa_pairs = [json.loads(line) for line in response.text.splitlines()]

In [11]:
qa_pairs[0]

{'question': 'How do the heroes Aeneas and Jason compare in their relationships with foreign populations?',
 'answer': 'Aeneas and Jason represent different models of interaction with foreign populations. Aeneas arrives as a refugee seeking to establish a new civilization in Italy, where he must deal with local populations through both conflict and alliance, ultimately leading to a mixed Roman-Italian people as decreed by Jupiter and Juno. Jason and the Argonauts, on the other hand, represent Greek colonial interaction with native populations, as shown in their visits to places like Cyzicus, Heraclea Pontica, and Cyrene. While their actions create models of Greek-native interaction, these encounters often result in Greeks either absorbing local customs or suppressing them, reflecting a more colonial approach to foreign relations.',
 'context': ['by Robin Mitchell-Boyask, Temple University, with sections adapted from Jim O\'Hara, UNC Chapel Hill\nAs you read Vergil, try to notice which 

Each generated result includes: 

- The generated **question** 
- The generated **answer** 
- The **context** (FineWeb documents) the question is based on 
- The **IDs** of those documents 
- The **question categories** used during generation 
- The **user categories** used during generation 

### How to find information about past requests

You can retrieve all information about your requests (such as request id, status, configuration, etc.) using the `get_all_requests` endpoint.

In [None]:
def get_all_requests():
 resp = requests.get(
 f"{BASE_URL}get_all_requests",
 headers={"Authorization": f"Bearer {API_KEY}"},
 )
 resp.raise_for_status()
 return resp.json()

def print_request_summary(requests):
 if 'data' not in requests:
 print('There are no requests')
 for request in requests['data']:
 print(f"{request['request_id']} : {request['status']}")

In [13]:
requests = get_all_requests()
print_request_summary(requests)

0fe377ae-2dd0-41ae-b3c3-680caa4b17f5 : completed
114a0e5d-3598-4e35-9105-0791fb542ef1 : completed
1e15a423-e38c-4f04-aa41-1768d55aa5f8 : completed
3d2f2208-acd1-4b38-bda7-e452532eef55 : completed
5d27a4f3-4031-4952-9a86-937e767ad095 : completed
c31818e9-795d-40ef-b0fc-b9b017ba0f80 : failed
c43e53e0-8baf-4a49-8eb0-1cebda7245e8 : completed
dbca2e71-d61d-4977-b0f3-ed6902ebfebf : completed
ed90803a-d3fc-4a16-94f7-51bdbc8fb8a2 : completed
ef2ddf55-0f4b-4604-8c74-5ca324362231 : completed
ef358e16-a95e-43aa-a491-fa9a63c873e0 : completed
f313a110-c596-4cda-a990-4a487aa3da2d : completed
f561077a-3bce-42d0-b143-0054eb0a5fd4 : completed
f747bac0-111d-4754-af22-58f99318a959 : completed
fc9a52b6-b7f2-4d3e-a9ef-441c603beb3f : completed
