File size: 1,485 Bytes
2c74cf2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# Model Placeholder This repository is ready to host optimized model variants for the Unicorn Execution Engine. ## Planned Model Files: ### Gemma 3n E2B Variants - `gemma3n-e2b-fp16-npu.safetensors` (MatFormer FP16 optimized) - `gemma3n-e2b-int8-npu.safetensors` (MatFormer INT8 quantized) - `gemma3n-e2b-config.json` (Model configuration) - `gemma3n-e2b-tokenizer.json` (Tokenizer configuration) ### Qwen2.5-7B Variants - `qwen25-7b-fp16-hybrid.safetensors` (Hybrid execution FP16) - `qwen25-7b-int8-hybrid.safetensors` (Hybrid execution INT8) - `qwen25-7b-config.json` (Model configuration) - `qwen25-7b-tokenizer.json` (Tokenizer configuration) ### NPU Optimization Files - `npu_attention_kernels.mlir` (MLIR-AIE kernels) - `igpu_optimization_configs.json` (ROCm configurations) - `performance_profiles.json` (Turbo mode profiles) ## Model Sizes (Estimated) - **Gemma 3n E2B FP16**: ~4GB - **Gemma 3n E2B INT8**: ~2GB - **Qwen2.5-7B FP16**: ~14GB - **Qwen2.5-7B INT8**: ~7GB ## Performance Targets - **Gemma 3n E2B**: 100+ TPS with turbo mode - **Qwen2.5-7B**: 60+ TPS with hybrid execution - **Memory Usage**: <10GB total system budget - **Latency**: <30ms time to first token To create actual optimized models, run the Unicorn Execution Engine quantization pipeline: ```bash cd Unicorn-Execution-Engine python quantization_engine.py --model gemma3n-e2b --precision fp16 --target npu python quantization_engine.py --model qwen25-7b --precision int8 --target hybrid ``` |