README.md: use PyPI for installation instead of JFrog

#4
by hinairo - opened
Files changed (1) hide show
  1. README.md +175 -179
README.md CHANGED
@@ -1,179 +1,175 @@
1
- ---
2
- license: apache-2.0
3
- base_model:
4
- - Qwen/Qwen2.5-7B-Instruct
5
- base_model_relation: quantized
6
- pipeline_tag: text2text-generation
7
- language:
8
- - zho
9
- - eng
10
- - fra
11
- - spa
12
- - por
13
- - deu
14
- - ita
15
- - rus
16
- - jpn
17
- - kor
18
- - vie
19
- - tha
20
- - ara
21
- ---
22
-
23
- # Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.
24
-
25
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
-
27
- * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
-
29
- * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
-
31
- * __M__: Faster model, with accuracy degradation less than 1.5%.
32
-
33
- * __S__: The fastest model, with accuracy degradation less than 2%.
34
-
35
-
36
- __Goals of elastic models:__
37
-
38
- * Provide flexibility in cost vs quality selection for inference
39
- * Provide clear quality and latency benchmarks
40
- * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
- * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
- * Provide the best models and service for self-hosting.
43
-
44
- > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
45
-
46
- ![Performance Graph](images/performance_graph.png)
47
- -----
48
-
49
- ## Inference
50
-
51
- To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
52
-
53
- ```python
54
- import torch
55
- from transformers import AutoTokenizer
56
- from elastic_models.transformers import AutoModelForCausalLM
57
-
58
- # Currently we require to have your HF token
59
- # as we use original weights for part of layers and
60
- # model confugaration as well
61
- model_name = "Qwen/Qwen2.5-7B-Instruct"
62
- hf_token = ''
63
- device = torch.device("cuda")
64
-
65
- # Create mode
66
- tokenizer = AutoTokenizer.from_pretrained(
67
- model_name, token=hf_token
68
- )
69
- model = AutoModelForCausalLM.from_pretrained(
70
- model_name,
71
- token=hf_token,
72
- torch_dtype=torch.bfloat16,
73
- attn_implementation="sdpa",
74
- mode='S'
75
- ).to(device)
76
- model.generation_config.pad_token_id = tokenizer.eos_token_id
77
-
78
- # Inference simple as transformers library
79
- prompt = "Describe basics of DNNs quantization."
80
- messages = [
81
- {
82
- "role": "system",
83
- "content": "You are a search bot, answer on user text queries."
84
- },
85
- {
86
- "role": "user",
87
- "content": prompt
88
- }
89
- ]
90
-
91
- chat_prompt = tokenizer.apply_chat_template(
92
- messages, add_generation_prompt=True, tokenize=False
93
- )
94
-
95
- inputs = tokenizer(chat_prompt, return_tensors="pt")
96
- inputs.to(device)
97
-
98
- with torch.inference_mode():
99
- generate_ids = model.generate(**inputs, max_length=500)
100
-
101
- input_len = inputs['input_ids'].shape[1]
102
- generate_ids = generate_ids[:, input_len:]
103
- output = tokenizer.batch_decode(
104
- generate_ids,
105
- skip_special_tokens=True,
106
- clean_up_tokenization_spaces=False
107
- )[0]
108
-
109
- # Validate answer
110
- print(f"# Q:\n{prompt}\n")
111
- print(f"# A:\n{output}\n")
112
- ```
113
-
114
- __System requirements:__
115
- * GPUs: H100, L40s
116
- * CPU: AMD, Intel
117
- * Python: 3.10-3.12
118
-
119
-
120
- To work with our models just run these lines in your terminal:
121
-
122
- ```shell
123
- pip install thestage
124
- pip install elastic_models[nvidia]\
125
- --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
126
- --extra-index-url https://pypi.nvidia.com\
127
- --extra-index-url https://pypi.org/simple
128
-
129
- pip install flash_attn==2.7.3 --no-build-isolation
130
- pip uninstall apex
131
- ```
132
-
133
- Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
134
-
135
- ```shell
136
- thestage config set --api-token <YOUR_API_TOKEN>
137
- ```
138
-
139
- Congrats, now you can use accelerated models!
140
-
141
- ----
142
-
143
- ## Benchmarks
144
-
145
- Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
146
-
147
- ### Quality benchmarks
148
-
149
- | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
150
- |---------------|---|---|---|----|----------|------------|
151
- | arc_challenge | 49.10 | 50.10 | 53.20 | 52.60 | 52.60 | 41.70 | - |
152
- | mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
153
- | piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
154
- | winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
155
-
156
-
157
-
158
- * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
159
- * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
160
- * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
161
- * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
162
-
163
- ### Latency benchmarks
164
-
165
- __100 input/300 output; tok/s:__
166
-
167
- | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
168
- |-----------|-----|---|---|----|----------|------------|
169
- | H100 | 201 | 173 | 162 | 135 | 62 | 201 | - |
170
- | L40S | 76 | 67 | 61 | 47 | 43 | 78 | - |
171
-
172
-
173
-
174
- ## Links
175
-
176
- * __Platform__: [app.thestage.ai](app.thestage.ai)
177
- * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
178
- <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
179
- * __Contact email__: contact@thestage.ai
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-7B-Instruct
5
+ base_model_relation: quantized
6
+ pipeline_tag: text2text-generation
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+
23
+ # Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.
24
+
25
+ Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
+
27
+ * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
+
29
+ * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
+
31
+ * __M__: Faster model, with accuracy degradation less than 1.5%.
32
+
33
+ * __S__: The fastest model, with accuracy degradation less than 2%.
34
+
35
+
36
+ __Goals of elastic models:__
37
+
38
+ * Provide flexibility in cost vs quality selection for inference
39
+ * Provide clear quality and latency benchmarks
40
+ * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
+ * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
+ * Provide the best models and service for self-hosting.
43
+
44
+ > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
45
+
46
+ ![Performance Graph](images/performance_graph.png)
47
+ -----
48
+
49
+ ## Inference
50
+
51
+ To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
52
+
53
+ ```python
54
+ import torch
55
+ from transformers import AutoTokenizer
56
+ from elastic_models.transformers import AutoModelForCausalLM
57
+
58
+ # Currently we require to have your HF token
59
+ # as we use original weights for part of layers and
60
+ # model confugaration as well
61
+ model_name = "Qwen/Qwen2.5-7B-Instruct"
62
+ hf_token = ''
63
+ device = torch.device("cuda")
64
+
65
+ # Create mode
66
+ tokenizer = AutoTokenizer.from_pretrained(
67
+ model_name, token=hf_token
68
+ )
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ model_name,
71
+ token=hf_token,
72
+ torch_dtype=torch.bfloat16,
73
+ attn_implementation="sdpa",
74
+ mode='S'
75
+ ).to(device)
76
+ model.generation_config.pad_token_id = tokenizer.eos_token_id
77
+
78
+ # Inference simple as transformers library
79
+ prompt = "Describe basics of DNNs quantization."
80
+ messages = [
81
+ {
82
+ "role": "system",
83
+ "content": "You are a search bot, answer on user text queries."
84
+ },
85
+ {
86
+ "role": "user",
87
+ "content": prompt
88
+ }
89
+ ]
90
+
91
+ chat_prompt = tokenizer.apply_chat_template(
92
+ messages, add_generation_prompt=True, tokenize=False
93
+ )
94
+
95
+ inputs = tokenizer(chat_prompt, return_tensors="pt")
96
+ inputs.to(device)
97
+
98
+ with torch.inference_mode():
99
+ generate_ids = model.generate(**inputs, max_length=500)
100
+
101
+ input_len = inputs['input_ids'].shape[1]
102
+ generate_ids = generate_ids[:, input_len:]
103
+ output = tokenizer.batch_decode(
104
+ generate_ids,
105
+ skip_special_tokens=True,
106
+ clean_up_tokenization_spaces=False
107
+ )[0]
108
+
109
+ # Validate answer
110
+ print(f"# Q:\n{prompt}\n")
111
+ print(f"# A:\n{output}\n")
112
+ ```
113
+
114
+ __System requirements:__
115
+ * GPUs: H100, L40s
116
+ * CPU: AMD, Intel
117
+ * Python: 3.10-3.12
118
+
119
+
120
+ To work with our models just run these lines in your terminal:
121
+
122
+ ```shell
123
+ pip install thestage
124
+ pip install 'thestage-elastic-models[nvidia]'
125
+ pip install flash_attn==2.7.3 --no-build-isolation
126
+ pip uninstall apex
127
+ ```
128
+
129
+ Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
130
+
131
+ ```shell
132
+ thestage config set --api-token <YOUR_API_TOKEN>
133
+ ```
134
+
135
+ Congrats, now you can use accelerated models!
136
+
137
+ ----
138
+
139
+ ## Benchmarks
140
+
141
+ Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
142
+
143
+ ### Quality benchmarks
144
+
145
+ | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
146
+ |---------------|---|---|---|----|----------|------------|
147
+ | arc_challenge | 49.10 | 50.10 | 53.20 | 52.60 | 52.60 | 41.70 | - |
148
+ | mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
149
+ | piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
150
+ | winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
151
+
152
+
153
+
154
+ * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
155
+ * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
156
+ * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
157
+ * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
158
+
159
+ ### Latency benchmarks
160
+
161
+ __100 input/300 output; tok/s:__
162
+
163
+ | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
164
+ |-----------|-----|---|---|----|----------|------------|
165
+ | H100 | 201 | 173 | 162 | 135 | 62 | 201 | - |
166
+ | L40S | 76 | 67 | 61 | 47 | 43 | 78 | - |
167
+
168
+
169
+
170
+ ## Links
171
+
172
+ * __Platform__: [app.thestage.ai](app.thestage.ai)
173
+ * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
174
+ <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
175
+ * __Contact email__: contact@thestage.ai