File size: 1,485 Bytes
2c74cf2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Model Placeholder

This repository is ready to host optimized model variants for the Unicorn Execution Engine.

## Planned Model Files:

### Gemma 3n E2B Variants
- `gemma3n-e2b-fp16-npu.safetensors` (MatFormer FP16 optimized)
- `gemma3n-e2b-int8-npu.safetensors` (MatFormer INT8 quantized)
- `gemma3n-e2b-config.json` (Model configuration)
- `gemma3n-e2b-tokenizer.json` (Tokenizer configuration)

### Qwen2.5-7B Variants  
- `qwen25-7b-fp16-hybrid.safetensors` (Hybrid execution FP16)
- `qwen25-7b-int8-hybrid.safetensors` (Hybrid execution INT8)
- `qwen25-7b-config.json` (Model configuration)
- `qwen25-7b-tokenizer.json` (Tokenizer configuration)

### NPU Optimization Files
- `npu_attention_kernels.mlir` (MLIR-AIE kernels)
- `igpu_optimization_configs.json` (ROCm configurations)
- `performance_profiles.json` (Turbo mode profiles)

## Model Sizes (Estimated)
- **Gemma 3n E2B FP16**: ~4GB
- **Gemma 3n E2B INT8**: ~2GB  
- **Qwen2.5-7B FP16**: ~14GB
- **Qwen2.5-7B INT8**: ~7GB

## Performance Targets
- **Gemma 3n E2B**: 100+ TPS with turbo mode
- **Qwen2.5-7B**: 60+ TPS with hybrid execution
- **Memory Usage**: <10GB total system budget
- **Latency**: <30ms time to first token

To create actual optimized models, run the Unicorn Execution Engine quantization pipeline:

```bash
cd Unicorn-Execution-Engine
python quantization_engine.py --model gemma3n-e2b --precision fp16 --target npu
python quantization_engine.py --model qwen25-7b --precision int8 --target hybrid
```