|
# New Model Guide |
|
|
|
This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling `lm_eval.evaluator.evaluate()` to evaluate an existing model. |
|
|
|
In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lm_eval.api.model.LM` class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this `LM` subclass via adding it to the library! |
|
|
|
## Setup |
|
|
|
To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment: |
|
|
|
```sh |
|
# After forking... |
|
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git |
|
cd lm-evaluation-harness |
|
git checkout -b <model-type> |
|
pip install -e ".[dev]" |
|
``` |
|
|
|
Now, we'll create a new file where we'll be adding our model: |
|
|
|
```sh |
|
touch lm_eval/models/<my_model_filename>.py |
|
``` |
|
|
|
**Tip: this filename should not shadow package names! For example, naming your file `anthropic.py` is disallowed since the API's name on pypi is `anthropic`, but naming it `anthropic_llms.py` works with no problems.** |
|
|
|
## Interface |
|
|
|
All models must subclass the `lm_eval.api.model.LM` class. |
|
|
|
The LM class enforces a common interface via which we can extract responses from a model: |
|
|
|
```python |
|
class MyCustomLM(LM): |
|
#... |
|
def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]: |
|
#... |
|
|
|
|
|
def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]: |
|
#... |
|
|
|
|
|
def generate_until(self, requests: list[Instance]) -> list[str]: |
|
#... |
|
#... |
|
``` |
|
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below. |
|
|
|
We support three types of requests, consisting of different interactions / measurements with an autoregressive LM. |
|
|
|
All three request types take as input `requests` of type `list[Instance]` that have a matching `Instance.request_type` to the method name. |
|
|
|
- `generate_until` |
|
- Each request contains `Instance.args : Tuple[str, dict]` containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters. |
|
- Using this input and these generation parameters, text will be sampled from the language model (typically until a maximum output length or specific stopping string sequences--for example, `{"until": ["\n\n", "."], "max_gen_toks": 128}`). |
|
- The generated input+output text from the model will then be returned. |
|
|
|
- `loglikelihood` |
|
- Each request contains `Instance.args : Tuple[str, str]` containing 1. an input string to the LM and 2. a target string on which the loglikelihood of the LM producing this target, conditioned on the input, will be returned. |
|
- Each request will have, as result, `(ll, is_greedy): Tuple[float, int]` returned, where `ll` is a floating point number representing the log probability of generating the target string conditioned on the input, and `is_greedy` being either the value `0` or `1`, with it being `1` if and only if the target string *would be generated by greedy sampling from the LM* (that is, if the target string is the *most likely* N-token string to be output by the LM given the input. ) |
|
|
|
- `loglikelihood_rolling` |
|
- Each request contains `Instance.args : Tuple[str]`, which is an input string to the model whose *entire* loglikelihood, conditioned on purely the EOT token, will be calculated. |
|
- This is used to evaluate *perplexity* on a data distribution. |
|
- It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input. |
|
|
|
|
|
To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods! |
|
|
|
**Tip: be careful of indexing in loglikelihood!** |
|
|
|
|
|
LMs take in tokens in position `[0 1 2 ... N]` and output a probability distribution for token position `N+1`. We provide a simplified graphic here, excerpted from `huggingface.py`: |
|
|
|
``` |
|
# how this all works (illustrated on a causal decoder-only setup): |
|
# CTX CONT |
|
# inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1] |
|
# model \ \ |
|
# logits 1 2 3|4 5 6 7 8 9 <- the ctx half gets tossed out by the |
|
# cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice |
|
``` |
|
|
|
The final token of the target is not passed into the LM, because we want the LM's predictions *up to but not past* that final target token. For more information, check out https://github.com/EleutherAI/lm-evaluation-harness/issues/942 . |
|
|
|
## Registration |
|
|
|
Congrats on implementing your model! Now it's time to test it out. |
|
|
|
To make your model usable via the command line interface to `lm-eval` using `python -m lm_eval`, you'll need to tell `lm-eval` what your model's name is. |
|
|
|
This is done via a *decorator*, `lm_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python -m lm_eval --model <name>` and alert `lm-eval` to the model's existence. |
|
|
|
```python |
|
from lm_eval.api.registry import register_model |
|
|
|
@register_model("<name1>", "<name2>") |
|
class MyCustomLM(LM): |
|
``` |
|
|
|
Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library! |
|
|
|
**Tip: be sure to import your model in `lm_eval/models/__init__.py!`** |
|
|
|
## Testing |
|
|
|
We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py . |
|
|
|
## Chat Templating |
|
|
|
Many models are fine-tuned with a [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating) in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect. |
|
|
|
In order to make your model optionally compatible with a chat format, three additional methods must be implemented: |
|
|
|
```python |
|
class MyCustomLM(LM): |
|
#... |
|
@property |
|
def tokenizer_name(self) -> str: |
|
""" |
|
Return the name of the model's tokenizer and/or the accompanying chat template. |
|
The returned string is used to cache requests. |
|
|
|
Returns: |
|
str: The name of the model's tokenizer and/or chat template. |
|
""" |
|
|
|
def chat_template(self, chat_template: Union[bool, str] = False) -> str: |
|
""" |
|
Get the appropriate chat template for the model based on the `chat_template` argument. |
|
|
|
This method returns the chat template string to build the prompt from a chat history. |
|
The chat template is saved in the evaluation results for reproducibility. |
|
Boolean arguments should be used with models that have only one chat template, |
|
while string arguments are used with models that have multiple chat templates. |
|
For the reference implementation, see HFLM class in `lm_eval.models.huggingface`. |
|
|
|
Args: |
|
chat_template (Union[bool, str]): Specifies whether to apply a chat template: |
|
- If False: Do not apply any chat template. |
|
- If True: Apply the default chat template. |
|
- If str: Apply the specified chat template by name. |
|
|
|
Returns: |
|
str: The selected chat template in Jinja format. |
|
""" |
|
|
|
def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str: |
|
""" |
|
Process a chat history to create a string that can be tokenized and input into the model. |
|
|
|
Args: |
|
chat_history (List[Dict[str, str]]): A list of dictionaries representing the chat history, |
|
where each dictionary has "role" and "content" keys. |
|
|
|
Returns: |
|
str: A string representing the chat history that can be tokenized and fed into the model. |
|
""" |
|
``` |
|
|
|
- `apply_chat_template` |
|
- This method performs the bulk of the work required for chat-formatting. |
|
- As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to |
|
``` |
|
[ |
|
{"system": <user-provided system message such as "You are a helpful math-focused chatbot">}, |
|
{"user": <task example - a few-shot example 'input'>} |
|
{"assistant": <correct response to the above example>}, |
|
# ... more few-shot examples, potentially |
|
{"user": <test set query--response on which we will evaluate>}, |
|
] |
|
``` |
|
which can then be converted into a string input. |
|
- The output is a string representing this conversation that can be fed into the model. |
|
- For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference. |
|
- `tokenizer_name` |
|
- LM Eval Harness supports [caching requests](https://github.com/EleutherAI/lm-evaluation-harness/blob/4902aaaf1f374682f95ac25fe2e13b23faddc91a/lm_eval/__main__.py#L140) that are sent to a model, for faster setup when repeating an already-performed evaluation. |
|
- However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this `lm.tokenizer_name` string to distinguish caches for a given model (and chat template) from one another. |
|
- `chat_template` |
|
- Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility. |
|
|
|
If not implemented for a given model type, the flags `--apply_chat_template` , `--fewshot_as_multiturn`, and `--system_instruction` cannot be used. |
|
|
|
## Other |
|
|
|
**Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model! |
|
|
|
## Conclusion |
|
|
|
After reading this guide, you should be able to add new model APIs or implementations to the Eval Harness library! |
|
|