File size: 4,944 Bytes

ee415d9
 
 
 
 
 
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
 
 
 
 
 
 
c95c6cc
 
 
ee415d9
c95c6cc
 
ee415d9
 
c95c6cc
 
 
 
ee415d9
 
c95c6cc
 
ee415d9
 
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
 
 
 
 
 
 
c95c6cc
 
ee415d9
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c95c6cc
ee415d9
 
 
 
 
 
 
 
 
 
 
c95c6cc
 
 
 
ee415d9
 
 
 
 
 
c95c6cc
 
 
ee415d9
 
 
 
 
c95c6cc
 
ee415d9
 
 
 
 
 
c95c6cc

---
license: apache-2.0
base_model:
- mistralai/Mistral-7B-Instruct-v0.3
pipeline_tag: text2text-generation
---

# Elastic models

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. 

* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.

* __M__: Faster model, with accuracy degradation less than 1.5%.

* __S__: The fastest model, with accuracy degradation less than 2%.


__Goals of elastic models:__

* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks
* Provide interface of HF libraries: transformers and diffusers with a single line of code
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
* Provide the best models and service for self-hosting.

> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

## Inference

To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:

```python
import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model confugaration as well
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
hf_token = ''
hf_cache_dir = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    token=hf_token,
    cache_dir=hf_cache_dir,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa"
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
inputs = tokenizer(prompt, return_tensors="pt")
inputs.to(device)

generate_ids = model.generate(**inputs, max_length=500)
input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")
```

### Installation

__GPUs__: H100, L40s

__OS__: Linux #TODO

__Python__: 3.10-3.12

To work with our models

```shell
pip install thestage
pip install elastic_models
```

Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:

```shell
thestage config set --api-token <YOUR_API_TOKEN>
```

Congrats, now you can use accelerated models!

----

## Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!

### Quality benchmarks

For quality evaluation we have used: #TODO link to github

| Metric/Model  | S | M | L | XL | Original | W8A8, int8 |
|---------------|---|---|---|----|----------|------------|
| MMLU          | 0 | 0 | 0 | 0  | 0        | 0          |
| PIQA          | 0 | 0 | 0 | 0  | 0        | 0          |
| Arc Challenge | 0 | 0 | 0 | 0  | 0        | 0          |
| Winogrande    | 0 | 0 | 0 | 0  | 0        | 0          |


> __MMLU__: Evaluates/shows {MMLU} 

> __MMLU__: Evaluates/shows ... 

> __Arc Challenge__: Evaluates/shows ...

> __PIQA__: Evaluates/shows ... 

### Latency benchmarks

We have profiled models in different scenarios:

<table>
<tr><th> 100 input/300 output; tok/s </th><th> 1000 input/1000 output; tok/s </th></tr>
<tr><td>

| GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
|-----------|-----|---|---|----|----------|------------|
| H100      | 189 | 0 | 0 | 0  | 48       | 0          |
| L40s      | 79  | 0 | 0 | 0  | 42       | 0          |



</td><td>

| GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
|-----------|-----|---|---|----|----------|------------|
| H100      | 189 | 0 | 0 | 0  | 48       | 0          |
| L40s      | 79  | 0 | 0 | 0  | 42       | 0          |

</td></tr> </table>


## Links

* __Platform__: [app.thestage.ai](app.thestage.ai)
* __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
* __Contact email__: contact@thestage.ai