pvn251 commited on
Commit
e4bea57
·
verified ·
1 Parent(s): 0355e17

update answerability README

Browse files
Files changed (1) hide show
  1. answerability/lora/README.md +331 -116
answerability/lora/README.md CHANGED
@@ -1,162 +1,377 @@
1
  ---
2
- base_model: ibm-granite/granite-3.3-8b-instruct or ibm-granite/granite-3.3-2b-instruct
 
 
 
3
  library_name: peft
 
4
  ---
5
 
6
- # LoRA Adapter for Answerability Classification
7
- Welcome to Granite Experiments!
8
 
9
- Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions. Happy exploring!
 
 
 
 
10
 
11
- Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing support or guarantee performance.
12
-
13
- # Model Summary
14
- This is a LoRA adapter for binary answerability classification task. The model takes as input a multi-turn conversation and a set of documents, and classifies whether the user's final query is answerable or unanswerable based on the available information in the documents.
15
-
16
- We provide two variants of the LoRA adapter trained over Granite-3.3-2b-instruct and Granite-3.3-8b-instruct, respectively.
17
 
18
  - **Developer:** IBM Research
19
- - **Model type:** LoRA adapter for [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct) and [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
 
 
 
20
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
21
 
22
-
23
  ## Intended use
24
- This is a LoRA adapter that enables answerability classification for the final user query in a multi-turn conversation, with respect to a set of provided documents. The model is trained to determine whether the last user query is answerable or unanswerable, based solely on the information present in the documents. This makes it suitable for applications involving RAG and document-grounded chatbots, where knowing whether sufficient information exists to answer a query is crucial. The classification output from the answerability model can be used in several downstream applications, including but not limited to:
25
- - Filter out unanswerable questions before sending them to generation in RAG setting. By classifying a query as unanswerable upfront, the system can prevent hallucinated or misleading responses.
26
- - Re-query the retriever to get more relevant documents. If a query is initially deemed unanswerable, the retriever can be re-invoked with alternate formulations to fetch more relevant documents.
27
-
28
- **Model input**: The input to the model is a list of conversational turns and a list of documents converted to a string using `apply_chat_template` function. These turns can alternate between the `user` and `assistant` roles. The last turn is from the `user`. The list of documents is a dictionary with `text` field, which contains the text of the corresponding document.
29
-
30
- The LoRA adapter is trained so that, using the standard assistant role <|start_of_role|>assistant<|end_of_role|>, it outputs the answerability classification label directly.
31
-
32
- **Model output**: When prompted with the above input, the model generates the answerable or unanswerable output.
33
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ## Quickstart Example
36
 
37
- While you can invoke the adapter directly, as outlined below, we highly recommend calling it through [granite-common](https://github.com/ibm-granite/granite-common), which wraps the model with a tailored I/O processor. Please see an example usage notebook [here](https://github.com/ibm-granite/granite-common/blob/main/notebooks/intrinsics_openai.ipynb).
38
-
39
- If you prefer to invoke the LoRA adapter directly, you can use the following code. Before running the script, set the `LORA_NAME` parameter to the path of the directory that you downloaded the LoRA adapter. The download process is explained [here](https://huggingface.co/ibm-granite/rag-intrinsics-lib).
40
-
41
- ```
42
- import torch
43
- from transformers import AutoTokenizer, AutoModelForCausalLM
44
- from peft import PeftModel
45
- from peft import PeftModelForCausalLM as lora_model
46
-
47
- device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
48
-
49
- ANSWERABILITY_PROMPT = "<|start_of_role|>assistant<|end_of_role|>"
50
- BASE_NAME = NAME_OF_THE_BASE_MODEL
51
- LORA_NAME = PATH_TO_DOWNLOADED_DIRECTORY
52
 
53
- tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left',trust_remote_code=True)
54
- model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
55
- model_answerability = PeftModel.from_pretrained(model_base, LORA_NAME)
56
 
 
 
57
 
58
- convo = [
59
- {"role": "user", "content": "name the viceroy at the time of quit india movement"},
60
- {"role": "assistant", "content": "The Viceroy at the time of the Quit India Movement was Lord Linlithgow."},
61
- {"role": "user", "content": "opposition leader for the movement"},
62
- {"role": "assistant", "content": "A movement was launched at the Bombay session of the All - India Congress Committee by Mahatma Gandhi on 8 August 1942. Many students paid attention to Subhas Chandra Bose as a leader, who was in exile. He organized the Indian Legion in Germany , reorganized the Indian National Army with Japanese assistance , and soliciting help from the Axis Powers , conducted a guerrilla war against the British authorities."},
63
- {"role": "user", "content": "who is Patel?"},
64
- {"role": "assistant", "content": "Vallabhbhai Patel, popularly known as Sardar Patel , was the first Deputy Prime Minister of India . He was an Indian barrister and statesman , a senior leader of the Indian National Congress and a founding father of the Republic of India who played a leading role in the country 's struggle for independence and guided its integration into a united , independent nation . He acted as de facto Supreme Commander - in - chief of Indian army during the political integration of India and the Indo - Pakistani War of 1947 . He is credited by historians with ensuring the success of the rebellion across India. He was arrested and was imprisoned with the entire Congress Working Committee from 1942 to 1945"},
65
- {"role": "user", "content": "how do you pronounce Vallabhbhai?"},
66
- ]
67
 
68
- documents = [
69
- {'doc_id': 0, 'text': "Vallabhbhai Patel\nAmong Patel 's surviving family , Maniben Patel lived in a flat in Mumbai for the rest of her life following her father 's death ; she often led the work of the Sardar Patel Memorial Trust , which organises the prestigious annual Sardar Patel Memorial Lectures , and other charitable organisations . Dahyabhai Patel was a businessman who was elected to serve in the Lok Sabha ( the lower house of the Indian Parliament ) as an MP in the 1960s ."},
70
- {'doc_id': 1, 'text': "Vallabhbhai Patel\nPatel 's date of birth was never officially recorded ; Patel entered it as 31 October on his matriculation examination papers . He belonged to the Leuva Patel Patidar community of Central Gujarat , although the Leuva Patels and Kadava Patels have also claimed him as one of their own ."},
71
- {'doc_id': 2, 'text': "Vallabhbhai Patel\nIn April 2015 the Government of India declassified surveillance reports suggesting that Patel , while Home Minister , and Nehru were among officials involved in alleged government - authorised spying on the family of Subhas Chandra Bose ."}
72
- ]
73
 
74
- string = tokenizer.apply_chat_template(convo,documents=documents, tokenize=False,add_generation_prompt=True)
75
- inputs = string
76
 
77
- print(inputs)
78
- inputT = tokenizer(inputs, return_tensors="pt")
79
-
80
- output = model_answerability.generate(inputT["input_ids"].to(device), attention_mask=inputT["attention_mask"].to(device), max_new_tokens=5)
81
- output_text = tokenizer.decode(output[0])
82
- answer = output_text.split(ANSWERABILITY_PROMPT)[-1]
83
- print(answer)
84
- ```
85
-
86
- ## Training Details
87
 
 
88
 
89
- ### Training Data
90
 
91
- The training data uses the publicly available Government corpus from [MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on this corpus, we constructed a dataset consisting of a mix of human-created and synthetically generated multi-turn conversations. It includes two types of examples: (1) Answerable queries, where the final user question can be answered based on the provided documents. These examples teach the adapter to recognize when sufficient information is present to support an answer. (2) Unanswerable queries, where the documents lack the necessary information to answer the final user query. We used Mixtral as an automatic judge to validate the answerability labels and filter out noisy samples.
92
 
 
93
 
94
- #### Training Hyperparameters
95
- The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on validation set, and 90/10 split between training and validation.
96
 
97
- ## Evaluation
 
 
 
98
 
99
- ### Answerability Classification
 
100
 
 
 
 
101
 
102
- We evaluated the model against baselines on binary answerability classification using two separate benchmarks:
103
 
104
- - Single-turn Setting ([SQUADRun Benchmark](https://aclanthology.org/P18-2124.pdf)): In this setting, the user query and the supporting documents are provided. Our model was evaluated against standard baselines to measure its ability to determine whether a standalone question is answerable based on the document set.
 
 
105
 
 
106
 
107
- | | unanswerable | | | answerable | | | Classification Accuracy | Weighted F1 |
108
- |:------------------------------------------------------:|:-------------------:|:-------------:|:---------------:|:-----------------:|:-------------:|:---------------:|:----------------------------------:|:---------------------------------:|
109
- | | Precision | Recall | F1 | Precision | Recall | F1 | | |
110
- | BigBird (pre-trained embeddings) w/ MLP | 49.2 | 68.5 | 57.3 | 48 | 29.2 | 36.3 | 48.9 | 46.8 |
111
- | llama2-7b as classifier (Full SFT) | 72.2 | 71 | 71.6 | 71.4 | 72.6 | 72 | 71.8 | 71.8 |
112
- | Granite 3.3-2b LoRA | 78.5 | 69 | 73.4 | 72.3 | 81.1 | 76.4 | 75 | 74.9 |
113
- | Granite 3.3-8b LoRA | 88.1 | 59.3 | 70.9 | 69.3 | 92 | 79 | 75.6 | 75 |
114
 
 
 
 
 
115
 
116
- - Multi-turn Setting (MT-RAG Benchmark): In this setting, the model is given the full multi-turn conversation history along with the supporting documents. This benchmark evaluates the model's ability to assess answerability when the final user query can also depend on prior turns for context.
 
 
 
117
 
 
 
 
118
 
119
- | | unanswerable | | | answerable | | | Classification Accuracy | Weighted F1 Score |
120
- |:------------------------------------------------------:|:-------------------:|:-------------:|:---------------:|:-----------------:|:-------------:|:---------------:|:----------------------------------:|:---------------------------------:|
121
- | | Precision | Recall | F1 | Precision | Recall | F1 | | |
122
- | BigBird (pre-trained embeddings) w/ MLP | 69.6 | 77.6 | 73.4 | 70.1 | 60.8 | 65.2 | 69.8 | 69.6 |
123
- | llama2-7b as classifier (Full SFT) | 86.9 | 89.4 | 88.2 | 87.3 | 84.5 | 85.9 | 87.1 | 87.1 |
124
- | Granite 3.3-2b LoRA | 89.9 | 92.5 | 91.2 | 91.4 | 87.9 | 89.6 | 90.4 | 90.5 |
125
- | Granite 3.3-8b LoRA | 93.4 | 88.8 | 91.1 | 87.9 | 92.8 | 90.3 | 90.6 | 90.7 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
- - The following table presents results comparing frontier models with task-specific LoRAs on the answerability classification task on MT-RAG data. LoRAs consistently outperform frontier models, converging near ~90% accuracy regardless of base model size. Even small models like Granite 3.3-2B, once fine-tuned, match or surpass much larger models, including GPT-4o. The difference between LoRA and aLoRA is minimal, indicating both are effective fine-tuning strategies.
129
 
130
- | | Models | Accuracy |
131
- |:---------------------------------:|:----------------------------------------------------:|:---------------:|
132
- | Frontier Models out-of-the-box | Granite 3.3-2b-instruct | 62.4 |
133
- | | Granite 3.3-8b-instruct | 64.5 |
134
- | | GPT-OSS-20b | 70.7 |
135
- | | GPT-OSS-120b | 69.8 |
136
- | | GPT4o-mini | 80.8 |
137
- | | GPT4o | 82.5 |
138
- | Trained LoRAs | Granite 3.3-8b-instruct-answerability-LoRA | 90.6 |
139
- | | Granite 3.3-8b-instruct-answerability-aLoRA | 89.5 |
140
- | | Granite 3.3-2b-instruct-answerability-LoRA | 90.4 |
141
- | | Granite 3.3-2b-instruct-answerability-aLoRA | 89.1 |
142
- <!-- | | GPT-OSS-20b-answerability-LoRA | 90.8 |
143
- | | GPT-OSS-20b-answerability-aLoRA | 89.6 | -->
144
 
145
- ### Comparing LoRA Adapters vs. Vanilla Granite Models for Answer Quality
146
- We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct vs. their LoRA adapters on a subset of MT-RAG Benchmark. In this setup, each query is paired with only 5 retrieved passages as context.
 
 
 
 
 
 
 
 
147
 
148
- - Answerability Classification Performance: The LoRA adapter outperforms the vanilla model in overall F1 on both answerables and unanswerables. The LoRA adapter achieves higher recall on unanswerable queries, making it better at identifying questions that should not be answered. However, this comes at the cost of lower recall on answerable queries.
149
 
150
- - Joint Answerability-Faithfulness Score computed as:
151
- > = 1 (if model prediction = IDK/unanswerable ground truth = unanswerable)
 
152
 
153
- > = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground truth = answerable)
154
-
155
- > = 0 (otherwise)
156
 
157
- This score rewards the model for correctly abstaining on unanswerable queries (full credit) and for providing faithful answers on answerable queries (partial credit based on RAGAS Faithfulness). No credit is given for incorrect or unfaithful predictions.
158
 
159
- The LoRA adapters for granite-2b and granite-8b achieves 8\% and 13\% lifts on this metric respectively. This rewards the model for correctly abstaining on unanswerable queries and for being faithful when it chooses to answer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
 
162
  | | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score |
@@ -166,10 +381,10 @@ The LoRA adapters for granite-2b and granite-8b achieves 8\% and 13\% lifts on t
166
  | Granite 3.3-8b Instruct | 17 | 77 | 10 | 99 | 49 |
167
  | Granite 3.3-8b LoRA | 65 | 81 | 60 | 86 | 62 |
168
 
169
- ## Model Card Authors
170
 
171
  [Vraj Shah](mailto:vraj@ibm.com)
172
 
173
  ### Framework versions
174
 
175
- - PEFT 0.14.0
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
  library_name: peft
7
+ library_name: transformers
8
  ---
9
 
10
+ # Intrinsics for Answerability Classification
 
11
 
12
+ ## Model Summary
13
+ This is a RAG-specific family of intrinsics fine-tuned for binary answerability
14
+ classification task. The model takes as input a multi-turn conversation and a
15
+ set of documents, and classifies whether the user's final query is answerable or
16
+ unanswerable based on the available information in the documents.
17
 
18
+ We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
19
+ Granite-3.3-2b-instruct, Granite-3.3-8b-instruct, and GPT-OSS 20b.
 
 
 
 
20
 
21
  - **Developer:** IBM Research
22
+ - **Model type:** LoRA and aLoRA adapter for
23
+ [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
24
+ [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
25
+ and [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
26
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
27
 
 
28
  ## Intended use
29
+ This is a family of intrinsincs that enables answerability classification for
30
+ the final user query in a multi-turn conversation, with respect to a set of
31
+ provided documents. The model is trained to determine whether the last user
32
+ query is answerable or unanswerable, based solely on the information present in
33
+ the documents. This makes it suitable for applications involving RAG and
34
+ document-grounded chatbots, where knowing whether sufficient information exists
35
+ to answer a query is crucial. The classification output from the answerability
36
+ model can be used in several downstream applications, including but not limited
37
+ to:
38
+ - Filter out unanswerable questions before sending them to generation in RAG
39
+ setting. By classifying a query as unanswerable upfront, the system can prevent
40
+ hallucinated or misleading responses.
41
+ - Re-query the retriever to get more
42
+ relevant documents. If a query is initially deemed unanswerable, the retriever
43
+ can be re-invoked with alternate formulations to fetch more relevant documents.
44
+
45
+ **Model input**: The input to the answerability intrinsic is an
46
+ OpenAI-compatible chat completion request, containing a list of conversation
47
+ turns that can alternate between the `user` and `assistant` role and ending with
48
+ a `user` turn, as well as list of documents.
49
+
50
+ **Model output**: The output of the answerability intrinsic is the result of the
51
+ original chat completion request formatted as a JSON object containing the
52
+ answerability likelihood score.
53
+
54
+ Please see the code snippets in the Quickstart Example section below for
55
+ examples that illustrate the intrinsic's input/output.
56
 
57
  ## Quickstart Example
58
 
59
+ To run the answerability intrinsics through granite-common, you can either (a)
60
+ use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
61
+ Face transformers library. We provide below instructions for each of the two
62
+ approaches. Note that running inference using vLLM or another scalable
63
+ OpenAI-compatible inference backend should be significantly faster than using
64
+ the Hugging Face transformers library directly.
 
 
 
 
 
 
 
 
 
65
 
66
+ ### Using an OpenAI-Compatible Inference Backend
 
 
67
 
68
+ To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
69
+ follow the steps below.
70
 
71
+ 1. Install the granite-common library:
 
 
 
 
 
 
 
 
72
 
73
+ pip install git+https://github.com/ibm-granite/granite-common.git
74
+ pip install granite_common[nltk]
 
 
 
75
 
76
+ 2. Install the Hugging Face CLI:
 
77
 
78
+ pip install -U "huggingface_hub[cli]"
 
 
 
 
 
 
 
 
 
79
 
80
+ 3. Install vLLM:
81
 
82
+ pip install vllm
83
 
84
+ 4. Download the intrinsics library:
85
 
86
+ hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
87
 
88
+ 5. Edit the vLLM startup script found in `./rag-intrisics-lib/run_vllm.sh`
89
+ using your favorite editor:
90
 
91
+ Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
92
+ base model on which the desired LoRA adapter has been trained. Optionally,
93
+ edit the constant `PORT` to change the port on which vLLM will run. Save the
94
+ modified file and exit the editor.
95
 
96
+ 6. Start vLLM through the startup script. The first time you run the script,
97
+ you may have to change the permissions to allow execution:
98
 
99
+ cd rag-intrinsics-lib
100
+ chmod u+x ./run_vllm.sh
101
+ ./run_vllm.sh &
102
 
103
+ 7. Run the following code snippet:
104
 
105
+ import json
106
+ import openai
107
+ import granite_common
108
 
109
+ intrinsic_name = "answerability"
110
 
111
+ # Change the following constant to select a different base model
112
+ base_model_name = "granite-3.3-8b-instruct"
 
 
 
 
 
113
 
114
+ # Change the following constants as needed to reflect the location of the vLLM server
115
+ # The selected port should be identical to the one you specified in the vLLM startup script
116
+ openai_base_url = "http://localhost:55555/v1"
117
+ openai_api_key = "rag_intrinsics_1234"
118
 
119
+ # Fetch IO configuration file from Hugging Face Hub
120
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
121
+ intrinsic_name, base_model_name
122
+ )
123
 
124
+ # Instantiate input/output processors
125
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
126
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
127
 
128
+ # Sample request
129
+ request_json = {
130
+ "messages": [
131
+ {
132
+ "role": "assistant",
133
+ "content": "Welcome to pet questions!"
134
+ },
135
+ {
136
+ "content": "What is the population of Australia?",
137
+ "role": "user"
138
+ }
139
+ ],
140
+ "extra_body": {
141
+ "documents": [
142
+ {
143
+ "doc_id": "1",
144
+ "text": "My dog has fleas."
145
+ },
146
+ {
147
+ "doc_id": "2",
148
+ "text": "My cat does not have fleas."
149
+ }
150
+ ]
151
+ }
152
+ }
153
+
154
+ # Add other parameters
155
+ request_json["model"] = intrinsic_name
156
+ request_json["temperature"] = 0.0
157
+
158
+ # Apply input processor
159
+ intrinsic_kwargs = {}
160
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
161
+
162
+ # Run inference
163
+ client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
164
+ chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
165
 
166
+ # Apply output processor
167
+ processed_chat_completion = result_processor.transform(
168
+ chat_completion, rewritten_request
169
+ )
170
+
171
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
172
+ parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
173
+ print("JSON output:")
174
+ print(json.dumps(parsed_contents, indent=2))
175
+
176
+ ### Using the Hugging Face Transformers Library
177
+
178
+ To run the intrinsic using the Hugging Face transformers library directly,
179
+ follow the steps below.
180
+
181
+ 1. Install the granite-common library:
182
+
183
+ pip install git+https://github.com/ibm-granite/granite-common.git
184
+ pip install granite_common[nltk]
185
+
186
+ 2. Install the Hugging Face CLI:
187
+
188
+ pip install -U "huggingface_hub[cli]"
189
+
190
+ 3. Install PEFT:
191
+
192
+ pip install peft
193
+
194
+ 4. Install xgrammar:
195
+
196
+ pip install xgrammar
197
+
198
+ 5. Run the following code snippet:
199
+
200
+ import json
201
+ import granite_common.util
202
+ import peft
203
+
204
+ intrinsic_name = "answerability"
205
+
206
+ # Change the following constant to select a different base model
207
+ base_model_name = "granite-3.3-8b-instruct"
208
+
209
+ use_cuda = True # Set to False to use default PyTorch device for this machine + model
210
+
211
+ # Fetch IO configuration file from Hugging Face Hub
212
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
213
+ intrinsic_name, base_model_name
214
+ )
215
+
216
+ # Fetch LoRA directory from Hugging Face Hub
217
+ lora_dir = granite_common.intrinsics.util.obtain_lora(
218
+ intrinsic_name, base_model_name
219
+ )
220
+
221
+ # Instantiate input/output processors
222
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
223
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
224
+
225
+ # Sample request
226
+ request_json = {
227
+ "messages": [
228
+ {
229
+ "role": "assistant",
230
+ "content": "Welcome to pet questions!"
231
+ },
232
+ {
233
+ "content": "What is the population of Australia?",
234
+ "role": "user"
235
+ }
236
+ ],
237
+ "extra_body": {
238
+ "documents": [
239
+ {
240
+ "doc_id": "1",
241
+ "text": "My dog has fleas."
242
+ },
243
+ {
244
+ "doc_id": "2",
245
+ "text": "My cat does not have fleas."
246
+ }
247
+ ]
248
+ }
249
+ }
250
+
251
+ # Add additional parameters
252
+ request_json["model"] = intrinsic_name
253
+ request_json["temperature"] = 0.0
254
+
255
+ # Apply input processor
256
+ intrinsic_kwargs = {}
257
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
258
+
259
+ # Load the base model and merge LoRA weights
260
+ model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
261
+ if use_cuda:
262
+ model = model.cuda()
263
+
264
+ # Convert the chat completion request into a the Transformers library's proprietary
265
+ # format.
266
+ generate_input, other_input = (
267
+ granite_common.util.chat_completion_request_to_transformers_inputs(
268
+ rewritten_request,
269
+ tokenizer,
270
+ model,
271
+ )
272
+ )
273
+
274
+ # Use the Transformers library's APIs to generate one or more completions,
275
+ # then convert those completions into OpenAI-compatible chat completion
276
+ responses = granite_common.util.generate_with_transformers(
277
+ tokenizer, model, generate_input, other_input
278
+ )
279
+
280
+ # Apply output processor
281
+ transformed_responses = result_processor.transform(responses, rewritten_request)
282
+
283
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
284
+ parsed_contents = json.loads(transformed_responses.choices[0].message.content)
285
+ print("JSON output:")
286
+ print(json.dumps(parsed_contents, indent=2))
287
 
288
+ ## Training Details
289
 
290
+ ### Training Data
 
 
 
 
 
 
 
 
 
 
 
 
 
291
 
292
+ The training data uses the publicly available Government corpus from
293
+ [MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on
294
+ this corpus, we constructed a dataset consisting of a mix of human-created and
295
+ synthetically generated multi-turn conversations. It includes two types of
296
+ examples: (1) Answerable queries, where the final user question can be answered
297
+ based on the provided documents. These examples teach the adapter to recognize
298
+ when sufficient information is present to support an answer. (2) Unanswerable
299
+ queries, where the documents lack the necessary information to answer the final
300
+ user query. We used Mixtral as an automatic judge to validate the answerability
301
+ labels and filter out noisy samples.
302
 
303
+ #### Training Hyperparameters
304
 
305
+ The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
306
+ 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on
307
+ validation set, and 90/10 split between training and validation.
308
 
309
+ ## Evaluation
 
 
310
 
311
+ ### Answerability Classification
312
 
313
+ We evaluated the model on binary answerability classification using MT-RAG
314
+ Benchmark. In this setting, the model is given the full multi-turn conversation
315
+ history along with the supporting documents. This benchmark evaluates the
316
+ model's ability to assess answerability when the final user query can also
317
+ depend on prior turns for context. The following table presents results
318
+ comparing baselines and frontier models with task-specific answerability
319
+ intrinsics on the answerability classification task on MT-RAG data. The LoRAs
320
+ consistently outperform frontier models, converging near \~90% accuracy
321
+ regardless of base model size. Even small models like Granite 3.3-2B, once
322
+ fine-tuned, match or surpass much larger models, including GPT-4o. The
323
+ difference between LoRA and aLoRA is minimal, indicating both are effective
324
+ fine-tuning strategies.
325
+
326
+ | | Models | Unanswerable F1 | Answerable F1 | Classification Accuracy | Weighted F1 |
327
+ |:--------------------------------------------:|:----------------------------------------------:|:--------------------------:|:---------------------------:|:-------------------------------------:|:-------------------------:|
328
+ | Baselines | BigBird (pre-trained embeddings) w/ MLP | 73.4 | 65.2 | 69.8 | 69.6 |
329
+ | | llama2-7b as classifier (Full SFT) | 88.2 | 85.9 | 87.1 | 87.1 |
330
+ | Frontier Models out-of-the-box | Granite 3.3-2b-instruct | 48.7 | 70.4 | 62.4 | 58.7 |
331
+ | | Granite 3.3-8b-instruct | 62.8 | 65.2 | 64.5 | 63.9 |
332
+ | | GPT-OSS-20b | 77.3 | 58.3 | 70.7 | 68.5 |
333
+ | | GPT-OSS-120b | 70.2 | 68.9 | 69.8 | 69.6 |
334
+ | | GPT4o-mini | 82.7 | 78.1 | 80.8 | 80.6 |
335
+ | | GPT4o | 85.7 | 77.5 | 82.5 | 81.9 |
336
+ | Trained LoRAs/aLoRAs | Granite 3.3-2b LoRA | 91.2 | 89.6 | 90.4 | 90.5 |
337
+ | | Granite 3.3-8b LoRA | 91.1 | 90.3 | 90.6 | 90.7 |
338
+ | | GPT-OSS-20b LoRA | 91.6 | 89.8 | 90.8 | 90.8 |
339
+ | | Granite 3.3-2b aLoRA | 89.8 | 88.6 | 89.1 | 89.2 |
340
+ | | Granite 3.3-8b aLoRA | 90.1 | 89.6 | 89.5 | 89.9 |
341
+ | | GPT-OSS-20b aLoRA | 90.4 | 88.6 | 89.6 | 89.6 |
342
+
343
+
344
+ ### Comparing the Answerability Intrinsics vs. Vanilla Granite Models for Answer Quality
345
+
346
+ We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
347
+ vs. answerability intrinsics implemented as LoRA adapters on a subset of MT-RAG
348
+ Benchmark. In this setup, each query is paired with only 5 retrieved passages as
349
+ context.
350
+
351
+ - Answerability Classification Performance: The answerability intrinsics
352
+ outperform the vanilla model in overall F1 on both answerables and
353
+ unanswerables. The answerability intrinsics achieves higher recall on
354
+ unanswerable queries, making it better at identifying questions that should
355
+ not be answered. However, this comes at the cost of lower recall on answerable
356
+ queries.
357
+
358
+ - Joint Answerability-Faithfulness Score computed as: \> = 1 (if model
359
+ prediction = IDK/unanswerable ∩ ground truth = unanswerable)
360
+
361
+ > = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground
362
+ > truth = answerable)
363
+
364
+ > = 0 (otherwise)
365
+
366
+ This score rewards the model for correctly abstaining on unanswerable queries
367
+ (full credit) and for providing faithful answers on answerable queries
368
+ (partial credit based on RAGAS Faithfulness). No credit is given for incorrect
369
+ or unfaithful predictions.
370
+
371
+ The answerability intrinsics for granite-2b and granite-8b achieves 8% and 13%
372
+ lifts on this metric respectively. This rewards the model for correctly
373
+ abstaining on unanswerable queries and for being faithful when it chooses to
374
+ answer.
375
 
376
 
377
  | | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score |
 
381
  | Granite 3.3-8b Instruct | 17 | 77 | 10 | 99 | 49 |
382
  | Granite 3.3-8b LoRA | 65 | 81 | 60 | 86 | 62 |
383
 
384
+ ## Model Card Authors
385
 
386
  [Vraj Shah](mailto:vraj@ibm.com)
387
 
388
  ### Framework versions
389
 
390
+ - PEFT 0.14.0