|
--- |
|
library_name: transformers |
|
license: cc-by-nc-sa-4.0 |
|
pipeline_tag: text-ranking |
|
--- |
|
|
|
# Contextual AI Reranker v2 1B |
|
|
|
## Highlights |
|
|
|
Our reranker is on the cost/performance Pareto frontier across 5 key areas: |
|
- Instruction following (including capability to rank more recent information higher) |
|
- Question answering |
|
- Multilinguality |
|
- Product search / recommendation systems |
|
- Real-world use cases |
|
|
|
<p align="center"> |
|
<img src="main_benchmark.png" width="1200"/> |
|
<p> |
|
|
|
For more details on these and other benchmarks, please refer to our [blogpost](https://contextual.ai/blog/rerank-v2). |
|
|
|
## Overview |
|
|
|
- Model Type: Text Reranking |
|
- Supported Languages: 100+ |
|
- Number of Paramaters: 1B |
|
- Context Length: up to 32K |
|
- Blogpost: https://contextual.ai/blog/rerank-v2 |
|
|
|
## Quickstart |
|
|
|
### vLLM usage |
|
|
|
Requires vllm==0.10.0 for NVFP4 or vllm>=0.8.5 for BF16. |
|
|
|
```python |
|
import os |
|
os.environ['VLLM_USE_V1'] = '0' # v1 engine doesn’t support logits processor yet |
|
|
|
import torch |
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
def logits_processor(_, scores): |
|
"""Custom logits processor for vLLM reranking.""" |
|
index = scores[0].view(torch.uint16) |
|
scores = torch.full_like(scores, float("-inf")) |
|
scores[index] = 1 |
|
return scores |
|
|
|
|
|
def format_prompts(query: str, instruction: str, documents: list[str]) -> list[str]: |
|
"""Format query and documents into prompts for reranking.""" |
|
if instruction: |
|
instruction = f" {instruction}" |
|
prompts = [] |
|
for doc in documents: |
|
prompt = f"Check whether a given document contains information helpful to answer the query.\n<Document> {doc}\n<Query> {query}{instruction} ??" |
|
prompts.append(prompt) |
|
return prompts |
|
|
|
|
|
def infer_w_vllm(model_path: str, query: str, instruction: str, documents: list[str]): |
|
model = LLM( |
|
model=model_path, |
|
gpu_memory_utilization=0.85, |
|
max_model_len=8192, |
|
dtype="bfloat16", |
|
max_logprobs=2, |
|
max_num_batched_tokens=262144, |
|
) |
|
sampling_params = SamplingParams( |
|
temperature=0, |
|
max_tokens=1, |
|
logits_processors=[logits_processor] |
|
) |
|
prompts = format_prompts(query, instruction, documents) |
|
|
|
outputs = model.generate(prompts, sampling_params, use_tqdm=False) |
|
|
|
# Extract scores and create results |
|
results = [] |
|
for i, output in enumerate(outputs): |
|
score = ( |
|
torch.tensor([output.outputs[0].token_ids[0]], dtype=torch.uint16) |
|
.view(torch.bfloat16) |
|
.item() |
|
) |
|
results.append((score, i, documents[i])) |
|
|
|
# Sort by score (descending) |
|
results = sorted(results, key=lambda x: x[0], reverse=True) |
|
|
|
print(f"Query: {query}") |
|
print(f"Instruction: {instruction}") |
|
for score, doc_id, doc in results: |
|
print(f"Score: {score:.4f} | Doc: {doc}") |
|
``` |
|
|
|
|
|
### Transformers Usage |
|
|
|
Requires transformers>=4.51.0 for BF16. Not supported for NVFP4. |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
def format_prompts(query: str, instruction: str, documents: list[str]) -> list[str]: |
|
"""Format query and documents into prompts for reranking.""" |
|
if instruction: |
|
instruction = f" {instruction}" |
|
prompts = [] |
|
for doc in documents: |
|
prompt = f"Check whether a given document contains information helpful to answer the query.\n<Document> {doc}\n<Query> {query}{instruction} ??" |
|
prompts.append(prompt) |
|
return prompts |
|
|
|
|
|
def infer_w_hf(model_path: str, query: str, instruction: str, documents: list[str]): |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) |
|
if tokenizer.pad_token is None: |
|
tokenizer.pad_token = tokenizer.eos_token |
|
tokenizer.padding_side = "left" # so -1 is the real last token for all prompts |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype).to(device) |
|
model.eval() |
|
|
|
prompts = format_prompts(query, instruction, documents) |
|
enc = tokenizer( |
|
prompts, |
|
return_tensors="pt", |
|
padding=True, |
|
truncation=True, |
|
) |
|
input_ids = enc["input_ids"].to(device) |
|
attention_mask = enc["attention_mask"].to(device) |
|
|
|
with torch.no_grad(): |
|
out = model(input_ids=input_ids, attention_mask=attention_mask) |
|
|
|
next_logits = out.logits[:, -1, :] # [batch, vocab] |
|
|
|
scores_bf16 = next_logits[:, 0].to(torch.bfloat16) |
|
scores = scores_bf16.float().tolist() |
|
|
|
# Sort by score (descending) |
|
results = sorted([(s, i, documents[i]) for i, s in enumerate(scores)], key=lambda x: x[0], reverse=True) |
|
|
|
print(f"Query: {query}") |
|
print(f"Instruction: {instruction}") |
|
for score, doc_id, doc in results: |
|
print(f"Score: {score:.4f} | Doc: {doc}") |
|
``` |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{ctxl_rerank_v2_instruct_multilingual, |
|
title={Contextual AI Reranker v2}, |
|
author={George Halal, Sheshansh Agrawal}, |
|
year={2025}, |
|
url={https://contextual.ai/blog/rerank-v2}, |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Creative Commons Attribution Non Commercial Share Alike 4.0 (cc-by-nc-sa-4.0) |
|
|
|
## Contact |
|
|
|
For questions or issues, please open an issue on the model repository or contact george@contextual.ai. |