ibm-granite
/

rag-intrinsics-lib

Safetensors

Model card Files Files and versions

xet

Community

pvn251 commited on 21 days ago

Commit

e4bea57

verified ·

1 Parent(s): 0355e17

update answerability README

Browse files

Files changed (1) hide show

answerability/lora/README.md +331 -116

answerability/lora/README.md CHANGED Viewed

@@ -1,162 +1,377 @@
 ---
-base_model: ibm-granite/granite-3.3-8b-instruct or ibm-granite/granite-3.3-2b-instruct
 library_name: peft
 ---
-# LoRA Adapter for Answerability Classification
-Welcome to Granite Experiments!
-Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions. Happy exploring!
-Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing support or guarantee performance.
-# Model Summary
-This is a LoRA adapter for binary answerability classification task. The model takes as input a multi-turn conversation and a set of documents, and classifies whether the user's final query is answerable or unanswerable based on the available information in the documents.
-We provide two variants of the LoRA adapter trained over Granite-3.3-2b-instruct and Granite-3.3-8b-instruct, respectively.
 - **Developer:** IBM Research
-- **Model type:** LoRA adapter for [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct) and [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ## Intended use
-This is a LoRA adapter that enables answerability classification for the final user query in a multi-turn conversation, with respect to a set of provided documents. The model is trained to determine whether the last user query is answerable or unanswerable, based solely on the information present in the documents. This makes it suitable for applications involving RAG and document-grounded chatbots, where knowing whether sufficient information exists to answer a query is crucial. The classification output from the answerability model can be used in several downstream applications, including but not limited to:
-- Filter out unanswerable questions before sending them to generation in RAG setting.  By classifying a query as unanswerable upfront, the system can prevent hallucinated or misleading responses.
-- Re-query the retriever to get more relevant documents. If a query is initially deemed unanswerable, the retriever can be re-invoked with alternate formulations to fetch more relevant documents.
-**Model input**: The input to the model is a list of conversational turns and a list of documents converted to a string using `apply_chat_template` function. These turns can alternate between the `user` and `assistant` roles. The last turn is from the `user`.  The list of documents is a dictionary with `text` field, which contains the text of the corresponding document.
-The LoRA adapter is trained so that, using the standard assistant role <|start_of_role|>assistant<|end_of_role|>, it outputs the answerability classification label directly.
-**Model output**: When prompted with the above input, the model generates the answerable or unanswerable output.
 ## Quickstart Example
-While you can invoke the adapter directly, as outlined below, we highly recommend calling it through [granite-common](https://github.com/ibm-granite/granite-common), which wraps the model with a tailored I/O processor. Please see an example usage notebook [here](https://github.com/ibm-granite/granite-common/blob/main/notebooks/intrinsics_openai.ipynb).
-If you prefer to invoke the LoRA adapter directly, you can use the following code.  Before running the script, set the `LORA_NAME` parameter to the path of the directory that you downloaded the LoRA adapter. The download process is explained [here](https://huggingface.co/ibm-granite/rag-intrinsics-lib).
-```
-import  torch
-from  transformers  import  AutoTokenizer,  AutoModelForCausalLM
-from  peft  import  PeftModel
-from peft import PeftModelForCausalLM as lora_model
-device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-ANSWERABILITY_PROMPT = "<|start_of_role|>assistant<|end_of_role|>"
-BASE_NAME = NAME_OF_THE_BASE_MODEL
-LORA_NAME = PATH_TO_DOWNLOADED_DIRECTORY
-tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left',trust_remote_code=True)
-model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
-model_answerability = PeftModel.from_pretrained(model_base, LORA_NAME)
-convo = [
-    {"role": "user", "content": "name the viceroy at the time of quit india movement"},
-    {"role": "assistant", "content": "The Viceroy at the time of the Quit India Movement was Lord Linlithgow."},
-    {"role": "user", "content": "opposition leader for the movement"},
-    {"role": "assistant", "content": "A movement was launched at the Bombay session of the All - India Congress Committee by Mahatma Gandhi on 8 August 1942.  Many students paid attention to Subhas Chandra Bose as a leader,  who was in exile. He organized the Indian Legion in Germany , reorganized the Indian National Army with Japanese assistance , and soliciting help from the Axis Powers , conducted a guerrilla war against the British authorities."},
-    {"role": "user", "content": "who is Patel?"},
-    {"role": "assistant", "content": "Vallabhbhai Patel, popularly known as Sardar Patel , was the first Deputy Prime Minister of India . He was an Indian barrister and statesman , a senior leader of the Indian National Congress and a founding father of the Republic of India who played a leading role in the country 's struggle for independence and guided its integration into a united , independent nation . He acted as de facto Supreme Commander - in - chief of Indian army during the political integration of India and the Indo - Pakistani War of 1947 . He is credited by historians with ensuring the success of the rebellion across India. He was arrested and was imprisoned with the entire Congress Working Committee from 1942 to 1945"},
-    {"role": "user", "content": "how do you pronounce Vallabhbhai?"},
- ]
-documents = [
-    {'doc_id': 0, 'text': "Vallabhbhai Patel\nAmong Patel 's surviving family , Maniben Patel lived in a flat in Mumbai for the rest of her life following her father 's death ; she often led the work of the Sardar Patel Memorial Trust , which organises the prestigious annual Sardar Patel Memorial Lectures , and other charitable organisations . Dahyabhai Patel was a businessman who was elected to serve in the Lok Sabha ( the lower house of the Indian Parliament ) as an MP in the 1960s ."},
-    {'doc_id': 1, 'text': "Vallabhbhai Patel\nPatel 's date of birth was never officially recorded ; Patel entered it as 31 October on his matriculation examination papers . He belonged to the Leuva Patel Patidar community of Central Gujarat , although the Leuva Patels and Kadava Patels have also claimed him as one of their own ."},
-    {'doc_id': 2, 'text': "Vallabhbhai Patel\nIn April 2015 the Government of India declassified surveillance reports suggesting that Patel , while Home Minister , and Nehru were among officials involved in alleged government - authorised spying on the family of Subhas Chandra Bose ."}
-]
-string = tokenizer.apply_chat_template(convo,documents=documents, tokenize=False,add_generation_prompt=True)
-inputs = string
-print(inputs)
-inputT = tokenizer(inputs, return_tensors="pt")
-output = model_answerability.generate(inputT["input_ids"].to(device), attention_mask=inputT["attention_mask"].to(device), max_new_tokens=5)
-output_text = tokenizer.decode(output[0])
-answer = output_text.split(ANSWERABILITY_PROMPT)[-1]
-print(answer)
-```
-## Training Details
-### Training Data
-The training data uses the publicly available Government corpus from [MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents.  Based on this corpus, we constructed a dataset consisting of a mix of human-created and synthetically generated multi-turn conversations. It includes two types of examples: (1) Answerable queries, where the final user question can be answered based on the provided documents. These examples teach the adapter to recognize when sufficient information is present to support an answer. (2) Unanswerable queries, where the documents lack the necessary information to answer the final user query. We used Mixtral as an automatic judge to validate the answerability labels and filter out noisy samples.
-#### Training Hyperparameters
-The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on validation set, and 90/10 split between training and validation.
-## Evaluation
-### Answerability Classification
-We evaluated the model against baselines on binary answerability classification using two separate benchmarks:
-- Single-turn Setting ([SQUADRun Benchmark](https://aclanthology.org/P18-2124.pdf)): In this setting, the user query and the supporting documents are provided. Our model was evaluated against standard baselines to measure its ability to determine whether a standalone question is answerable based on the document set.
-|                                                        |     unanswerable    |               |                 |     answerable    |               |                 |     Classification     Accuracy    |     Weighted        F1    |
-|:------------------------------------------------------:|:-------------------:|:-------------:|:---------------:|:-----------------:|:-------------:|:---------------:|:----------------------------------:|:---------------------------------:|
-|                                                        |       Precision     |     Recall    |     F1   |      Precision    |     Recall    |     F1    |                                    |                                   |
-|     BigBird (pre-trained      embeddings)   w/ MLP     |         49.2        |      68.5     |       57.3      |         48        |      29.2     |       36.3      |                 48.9               |                46.8               |
-|                llama2-7b as classifier (Full SFT)               |         72.2        |       71      |       71.6      |        71.4       |      72.6     |        72       |                 71.8               |                71.8               |
-|     Granite 3.3-2b LoRA    |     78.5    |     69    |     73.4    |     72.3    |     81.1    |     76.4    |     75    |     74.9    |
-|     Granite 3.3-8b LoRA    |     88.1    |     59.3    |     70.9    |     69.3    |     92    |     79    |     75.6    |     75    |
-- Multi-turn Setting (MT-RAG Benchmark): In this setting, the model is given the full multi-turn conversation history along with the supporting documents. This benchmark evaluates the model's ability to assess answerability when the final user query can also depend on prior turns for context.
-|                                                        |     unanswerable    |               |                 |     answerable    |               |                 |     Classification     Accuracy    |     Weighted        F1   Score    |
-|:------------------------------------------------------:|:-------------------:|:-------------:|:---------------:|:-----------------:|:-------------:|:---------------:|:----------------------------------:|:---------------------------------:|
-|                                                        |       Precision     |     Recall    |     F1    |      Precision    |     Recall    |     F1    |                                    |                                   |
-|     BigBird (pre-trained      embeddings)   w/ MLP     |         69.6        |      77.6     |       73.4      |        70.1       |      60.8     |       65.2      |                 69.8               |                69.6               |
-|       llama2-7b   as classifier      (Full   SFT)      |         86.9        |      89.4     |       88.2      |        87.3       |      84.5     |       85.9      |                 87.1               |                87.1               |
-| Granite 3.3-2b LoRA | 89.9 | 92.5 | 91.2 | 91.4 | 87.9 | 89.6 | 90.4 | 90.5 |
-| Granite 3.3-8b LoRA | 93.4 | 88.8 | 91.1 | 87.9 | 92.8 | 90.3 | 90.6 | 90.7 |
-- The following table presents results comparing frontier models with task-specific LoRAs on the answerability classification task on MT-RAG data. LoRAs consistently outperform frontier models, converging near ~90% accuracy regardless of base model size. Even small models like Granite 3.3-2B, once fine-tuned, match or surpass much larger models, including GPT-4o. The difference between LoRA and aLoRA is minimal, indicating both are effective fine-tuning strategies.
-|                                   |                         Models                       |     Accuracy    |
-|:---------------------------------:|:----------------------------------------------------:|:---------------:|
-|     Frontier   Models out-of-the-box         |     Granite   3.3-2b-instruct                        |       62.4      |
-|              |     Granite   3.3-8b-instruct                        |       64.5      |
-|                                   |     GPT-OSS-20b                                      |       70.7      |
-|                                   |     GPT-OSS-120b                                     |       69.8      |
-|                                   |     GPT4o-mini                                       |       80.8      |
-|                                   |     GPT4o                                            |       82.5      |
-|           Trained   LoRAs         |     Granite   3.3-8b-instruct-answerability-LoRA     |       90.6      |
-|                                   |     Granite   3.3-8b-instruct-answerability-aLoRA    |       89.5      |
-|                                   |     Granite   3.3-2b-instruct-answerability-LoRA     |       90.4      |
-|                                   |     Granite   3.3-2b-instruct-answerability-aLoRA    |       89.1      |
-<!-- |                                   |     GPT-OSS-20b-answerability-LoRA                   |       90.8      |
-|                                   |     GPT-OSS-20b-answerability-aLoRA                  |       89.6      | -->
-### Comparing LoRA Adapters vs. Vanilla Granite Models for Answer Quality
-We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct vs. their LoRA adapters on a subset of MT-RAG Benchmark. In this setup, each query is paired with only 5 retrieved passages as context.
-- Answerability Classification Performance: The LoRA adapter outperforms the vanilla model in overall F1 on both answerables and unanswerables. The LoRA adapter achieves higher recall on unanswerable queries, making it better at identifying questions that should not be answered. However, this comes at the cost of lower recall on answerable queries.
-- Joint Answerability-Faithfulness Score computed as:
-    > = 1 (if model prediction = IDK/unanswerable ∩ ground truth = unanswerable)
-    > = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground truth = answerable)
-    > = 0 (otherwise)
-    This score rewards the model for correctly abstaining on unanswerable queries (full credit) and for providing faithful answers on answerable queries (partial credit based on RAGAS Faithfulness). No credit is given for incorrect or unfaithful predictions.
-The LoRA adapters for granite-2b and granite-8b achieves 8\% and 13\% lifts on this metric respectively. This rewards the model for correctly abstaining on unanswerable queries and for being faithful when it chooses to answer.
 |                         | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score |
@@ -166,10 +381,10 @@ The LoRA adapters for granite-2b and granite-8b achieves 8\% and 13\% lifts on t
 | Granite 3.3-8b Instruct |           17          |          77         |          10         |         99        | 49                                      |
 |   Granite 3.3-8b LoRA   |           65          |          81         |          60         |         86        | 62                                      |
-## Model Card Authors
 [Vraj Shah](mailto:vraj@ibm.com)
 ### Framework versions
-- PEFT 0.14.0

 ---
+license: apache-2.0
+language:
+- en
+pipeline_tag: text-generation
 library_name: peft
+library_name: transformers
 ---
+# Intrinsics for Answerability Classification
+## Model Summary
+This is a RAG-specific family of intrinsics fine-tuned for binary answerability
+classification task. The model takes as input a multi-turn conversation and a
+set of documents, and classifies whether the user's final query is answerable or
+unanswerable based on the available information in the documents.
+We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
+Granite-3.3-2b-instruct, Granite-3.3-8b-instruct, and GPT-OSS 20b.
 - **Developer:** IBM Research
+- **Model type:** LoRA and aLoRA adapter for
+  [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
+  [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
+  and [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ## Intended use
+This is a family of intrinsincs that enables answerability classification for
+the final user query in a multi-turn conversation, with respect to a set of
+provided documents. The model is trained to determine whether the last user
+query is answerable or unanswerable, based solely on the information present in
+the documents. This makes it suitable for applications involving RAG and
+document-grounded chatbots, where knowing whether sufficient information exists
+to answer a query is crucial. The classification output from the answerability
+model can be used in several downstream applications, including but not limited
+to:
+- Filter out unanswerable questions before sending them to generation in RAG
+setting. By classifying a query as unanswerable upfront, the system can prevent
+hallucinated or misleading responses.
+- Re-query the retriever to get more
+relevant documents. If a query is initially deemed unanswerable, the retriever
+can be re-invoked with alternate formulations to fetch more relevant documents.
+**Model input**: The input to the answerability intrinsic is an
+OpenAI-compatible chat completion request, containing a list of conversation
+turns that can alternate between the `user` and `assistant` role and ending with
+a `user` turn, as well as list of documents.
+**Model output**: The output of the answerability intrinsic is the result of the
+original chat completion request formatted as a JSON object containing the
+answerability likelihood score.
+Please see the code snippets in the Quickstart Example section below for
+examples that illustrate the intrinsic's input/output.
 ## Quickstart Example
+To run the answerability intrinsics through granite-common, you can either (a)
+use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
+Face transformers library. We provide below instructions for each of the two
+approaches. Note that running inference using vLLM or another scalable
+OpenAI-compatible inference backend should be significantly faster than using
+the Hugging Face transformers library directly.
+### Using an OpenAI-Compatible Inference Backend
+To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
+follow the steps below.
+1.  Install the granite-common library:
+        pip install git+https://github.com/ibm-granite/granite-common.git
+        pip install granite_common[nltk]
+2.  Install the Hugging Face CLI:
+        pip install -U "huggingface_hub[cli]"
+3.  Install vLLM:
+        pip install vllm
+4.  Download the intrinsics library:
+        hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
+5.  Edit the vLLM startup script found in `./rag-intrisics-lib/run_vllm.sh`
+    using your favorite editor:
+    Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
+    base model on which the desired LoRA adapter has been trained. Optionally,
+    edit the constant `PORT` to change the port on which vLLM will run. Save the
+    modified file and exit the editor.
+6.  Start vLLM through the startup script. The first time you run the script,
+    you may have to change the permissions to allow execution:
+        cd rag-intrinsics-lib
+        chmod u+x ./run_vllm.sh
+        ./run_vllm.sh &
+7.  Run the following code snippet:
+        import json
+        import openai
+        import granite_common
+        intrinsic_name = "answerability"
+        # Change the following constant to select a different base model
+        base_model_name = "granite-3.3-8b-instruct"
+        # Change the following constants as needed to reflect the location of the vLLM server
+        # The selected port should be identical to the one you specified in the vLLM startup script
+        openai_base_url = "http://localhost:55555/v1"
+        openai_api_key = "rag_intrinsics_1234"
+        # Fetch IO configuration file from Hugging Face Hub
+        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
+            intrinsic_name, base_model_name
+        )
+        # Instantiate input/output processors
+        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
+        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
+        # Sample request
+        request_json = {
+             "messages": [
+                 {
+                 "role": "assistant",
+                 "content": "Welcome to pet questions!"
+                 },
+                 {
+                 "content": "What is the population of Australia?",
+                 "role": "user"
+                 }
+             ],
+             "extra_body": {
+                 "documents": [
+                 {
+                     "doc_id": "1",
+                     "text": "My dog has fleas."
+                 },
+                 {
+                     "doc_id": "2",
+                     "text": "My cat does not have fleas."
+                 }
+                 ]
+             }
+         }
+        # Add other parameters
+        request_json["model"] = intrinsic_name
+        request_json["temperature"] = 0.0
+        # Apply input processor
+        intrinsic_kwargs = {}
+        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
+        # Run inference
+        client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
+        chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
+        # Apply output processor
+        processed_chat_completion = result_processor.transform(
+            chat_completion, rewritten_request
+        )
+        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
+        parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
+        print("JSON output:")
+        print(json.dumps(parsed_contents, indent=2))
+### Using the Hugging Face Transformers Library
+To run the intrinsic using the Hugging Face transformers library directly,
+follow the steps below.
+1.  Install the granite-common library:
+        pip install git+https://github.com/ibm-granite/granite-common.git
+        pip install granite_common[nltk]
+2.  Install the Hugging Face CLI:
+        pip install -U "huggingface_hub[cli]"
+3.  Install PEFT:
+        pip install peft
+4.  Install xgrammar:
+        pip install xgrammar
+5.  Run the following code snippet:
+        import json
+        import granite_common.util
+        import peft
+        intrinsic_name = "answerability"
+        # Change the following constant to select a different base model
+        base_model_name = "granite-3.3-8b-instruct"
+        use_cuda = True  # Set to False to use default PyTorch device for this machine + model
+        # Fetch IO configuration file from Hugging Face Hub
+        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
+            intrinsic_name, base_model_name
+        )
+        # Fetch LoRA directory from Hugging Face Hub
+        lora_dir = granite_common.intrinsics.util.obtain_lora(
+            intrinsic_name, base_model_name
+        )
+        # Instantiate input/output processors
+        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
+        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
+        # Sample request
+        request_json = {
+             "messages": [
+                 {
+                 "role": "assistant",
+                 "content": "Welcome to pet questions!"
+                 },
+                 {
+                 "content": "What is the population of Australia?",
+                 "role": "user"
+                 }
+             ],
+             "extra_body": {
+                 "documents": [
+                 {
+                     "doc_id": "1",
+                     "text": "My dog has fleas."
+                 },
+                 {
+                     "doc_id": "2",
+                     "text": "My cat does not have fleas."
+                 }
+                 ]
+             }
+         }
+        # Add additional parameters
+        request_json["model"] = intrinsic_name
+        request_json["temperature"] = 0.0
+        # Apply input processor
+        intrinsic_kwargs = {}
+        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
+        # Load the base model and merge LoRA weights
+        model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
+        if use_cuda:
+            model = model.cuda()
+        # Convert the chat completion request into a the Transformers library's proprietary
+        # format.
+        generate_input, other_input = (
+            granite_common.util.chat_completion_request_to_transformers_inputs(
+                rewritten_request,
+                tokenizer,
+                model,
+            )
+        )
+        # Use the Transformers library's APIs to generate one or more completions,
+        # then convert those completions into OpenAI-compatible chat completion
+        responses = granite_common.util.generate_with_transformers(
+            tokenizer, model, generate_input, other_input
+        )
+        # Apply output processor
+        transformed_responses = result_processor.transform(responses, rewritten_request)
+        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
+        parsed_contents = json.loads(transformed_responses.choices[0].message.content)
+        print("JSON output:")
+        print(json.dumps(parsed_contents, indent=2))
+## Training Details
+### Training Data
+The training data uses the publicly available Government corpus from
+[MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on
+this corpus, we constructed a dataset consisting of a mix of human-created and
+synthetically generated multi-turn conversations. It includes two types of
+examples: (1) Answerable queries, where the final user question can be answered
+based on the provided documents. These examples teach the adapter to recognize
+when sufficient information is present to support an answer. (2) Unanswerable
+queries, where the documents lack the necessary information to answer the final
+user query. We used Mixtral as an automatic judge to validate the answerability
+labels and filter out noisy samples.
+#### Training Hyperparameters
+The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
+32, learning rate = 5e-6, number of epochs = 25, with early stopping based on
+validation set, and 90/10 split between training and validation.
+## Evaluation
+### Answerability Classification
+We evaluated the model on binary answerability classification using MT-RAG
+Benchmark. In this setting, the model is given the full multi-turn conversation
+history along with the supporting documents. This benchmark evaluates the
+model's ability to assess answerability when the final user query can also
+depend on prior turns for context. The following table presents results
+comparing baselines and frontier models with task-specific answerability
+intrinsics on the answerability classification task on MT-RAG data. The LoRAs
+consistently outperform frontier models, converging near \~90% accuracy
+regardless of base model size. Even small models like Granite 3.3-2B, once
+fine-tuned, match or surpass much larger models, including GPT-4o. The
+difference between LoRA and aLoRA is minimal, indicating both are effective
+fine-tuning strategies.
+|                                      |    Models |     Unanswerable     F1    |     Answerable        F1    |     Classification        Accuracy    |     Weighted        F1    |
+|:--------------------------------------------:|:----------------------------------------------:|:--------------------------:|:---------------------------:|:-------------------------------------:|:-------------------------:|
+|                   Baselines                  |     BigBird (pre-trained embeddings) w/ MLP    |             73.4           |             65.2            |                  69.8                 |            69.6           |
+|                                              |       llama2-7b   as classifier (Full SFT)     |             88.2           |             85.9            |                  87.1                 |            87.1           |
+|     Frontier   Models      out-of-the-box    |            Granite   3.3-2b-instruct           |             48.7           |             70.4            |                  62.4                 |            58.7           |
+|                                              |            Granite   3.3-8b-instruct           |             62.8           |             65.2            |                  64.5                 |            63.9           |
+|                                              |                   GPT-OSS-20b                  |             77.3           |             58.3            |                  70.7                 |            68.5           |
+|                                              |                   GPT-OSS-120b                 |             70.2           |             68.9            |                  69.8                 |            69.6           |
+|                                              |                    GPT4o-mini                  |             82.7           |             78.1            |                  80.8                 |            80.6           |
+|                                              |                      GPT4o                     |             85.7           |             77.5            |                  82.5                 |            81.9           |
+|          Trained        LoRAs/aLoRAs         |              Granite   3.3-2b LoRA             |             91.2           |             89.6            |                  90.4                 |            90.5           |
+|                                              |              Granite   3.3-8b LoRA             |             91.1           |             90.3            |                  90.6                 |            90.7           |
+|                                              |                GPT-OSS-20b   LoRA              |             91.6           |             89.8            |                  90.8                 |            90.8           |
+|                                              |              Granite   3.3-2b aLoRA            |             89.8           |             88.6            |                  89.1                 |            89.2           |
+|                                              |              Granite   3.3-8b aLoRA            |             90.1           |             89.6            |                  89.5                 |            89.9           |
+|                                              |               GPT-OSS-20b   aLoRA              |             90.4           |             88.6            |                  89.6                 |            89.6           |
+### Comparing the Answerability Intrinsics vs. Vanilla Granite Models for Answer Quality
+We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
+vs. answerability intrinsics implemented as LoRA adapters on a subset of MT-RAG
+Benchmark. In this setup, each query is paired with only 5 retrieved passages as
+context.
+- Answerability Classification Performance: The answerability intrinsics
+  outperform the vanilla model in overall F1 on both answerables and
+  unanswerables. The answerability intrinsics achieves higher recall on
+  unanswerable queries, making it better at identifying questions that should
+  not be answered. However, this comes at the cost of lower recall on answerable
+  queries.
+- Joint Answerability-Faithfulness Score computed as: \> = 1 (if model
+  prediction = IDK/unanswerable ∩ ground truth = unanswerable)
+  > = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground
+  > truth = answerable)
+  > = 0 (otherwise)
+  This score rewards the model for correctly abstaining on unanswerable queries
+  (full credit) and for providing faithful answers on answerable queries
+  (partial credit based on RAGAS Faithfulness). No credit is given for incorrect
+  or unfaithful predictions.
+The answerability intrinsics for granite-2b and granite-8b achieves 8% and 13%
+lifts on this metric respectively. This rewards the model for correctly
+abstaining on unanswerable queries and for being faithful when it chooses to
+answer.
 |                         | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score |
 | Granite 3.3-8b Instruct |           17          |          77         |          10         |         99        | 49                                      |
 |   Granite 3.3-8b LoRA   |           65          |          81         |          60         |         86        | 62                                      |
+## Model Card Authors
 [Vraj Shah](mailto:vraj@ibm.com)
 ### Framework versions
+- PEFT 0.14.0