File size: 16,038 Bytes

---
language:
- en
- zh
license: apache-2.0
library_name: mlx
tags:
- text-generation
- mlx
- apple-silicon
- gpt
- quantized
- 8bit-quantization
pipeline_tag: text-generation
base_model: openai/gpt-oss-20b
model-index:
- name: gpt-oss-20b-MLX-8bit
  results:
  - task:
      type: text-generation
    dataset:
      name: GPT-OSS-20B Evaluation
      type: openai/gpt-oss-20b
    metrics:
    - type: bits_per_weight
      value: 4.619
      name: Bits per weight (8-bit)
---

# Jackrong/gpt-oss-20b-MLX-8bit

This model [Jackrong/gpt-oss-20b-MLX-8bit](https://huggingface.co/Jackrong/gpt-oss-20b-MLX-8bit) was
converted to MLX format from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
using mlx-lm version **0.27.0**.

# 🚀 GPT-OSS-20B MLX Performance Report - Apple Silicon

## 📋 Executive Summary

**Test Date:** 2025-08-31T08:37:22.914637  
**Test Query:** **Do machines possess the ability to think?**

**Hardware:** Apple Silicon MacBookPro
**Framework:** MLX (Apple's Machine Learning Framework) 

## 🖥️ Hardware Specifications

### System Information
- **macOS Version:** 15.6.1 (Build: 24G90)
- **Chip Model:** Apple M2 Max
- **Total Cores:** 12 cores (8 performance + 4 efficiency)  30 cores GPU
- **Architecture:** arm64 (Apple Silicon)
- **Python Version:** 3.10.12

### Memory Configuration
- **Total RAM:** 32.0 GB
- **Available RAM:** 12.24 GB
- **Used RAM:** 19.76 GB (61.7% utilization)
- **Memory Type:** Unified Memory (LPDDR5)

### Storage
- **Main Disk:** 926.4 GB SSD total, 28.2 GB free (27.1% used)
## 📊 Performance Benchmarks

### Test Configuration
- **Temperature:** 1.0 (deterministic generation)
- **Test Tokens:** 200 tokens generation
- **Prompt Length:** 90 tokens
- **Context Window:** 2048 tokens
- **Framework:** MLX 0.29.0

### 4-bit Quantized Model Performance

| Metric | Value | Details |
|--------|-------|---------|
| **Prompt Processing** | 220.6 tokens/sec | 90 tokens processed |
| **Generation Speed** | 91.5 tokens/sec | 200 tokens generated |
| **Total Time** | ~2.18 seconds | Including prompt processing |
| **Time to First Token** | < 0.1 seconds | Very fast response |
| **Peak Memory Usage** | 11.3 GB | Efficient memory utilization |
| **Memory Efficiency** | 8.1 tokens/sec per GB | High efficiency score |

**Performance Notes:**
- Excellent prompt processing speed (220+ tokens/sec)
- Consistent generation performance (91.5 tokens/sec)
- Low memory footprint for 20B parameter model
- Optimal for memory-constrained environments

### 8-bit Quantized Model Performance

| Metric | Value | Details |
|--------|-------|---------|
| **Prompt Processing** | 233.7 tokens/sec | 90 tokens processed |
| **Generation Speed** | 84.2 tokens/sec | 200 tokens generated |
| **Total Time** | ~2.37 seconds | Including prompt processing |
| **Time to First Token** | < 0.1 seconds | Very fast response |
| **Peak Memory Usage** | 12.2 GB | Higher memory usage |
| **Memory Efficiency** | 6.9 tokens/sec per GB | Good efficiency |

**Performance Notes:**
- Fastest prompt processing (233+ tokens/sec)
- Solid generation performance (84.2 tokens/sec)
- Higher memory requirements but better quality potential
- Good balance for quality-focused applications

### Comparative Analysis

#### Performance Comparison Table

| Metric | 4-bit Quantized | 8-bit Quantized | Winner | Improvement |
|--------|----------------|-----------------|--------|-------------|
| **Prompt Speed** | 220.6 tokens/sec | 233.7 tokens/sec | 8-bit | +6.0% |
| **Generation Speed** | 91.5 tokens/sec | 84.2 tokens/sec | 4-bit | +8.7% |
| **Total Time (200 tokens)** | ~2.18s | ~2.37s | 4-bit | -8.0% |
| **Peak Memory** | 11.3 GB | 12.2 GB | 4-bit | -7.4% |
| **Memory Efficiency** | 8.1 tokens/sec/GB | 6.9 tokens/sec/GB | 4-bit | +17.4% |

#### Key Performance Insights

**🚀 Speed Analysis:**
- 4-bit model excels in generation speed (91.5 vs 84.2 tokens/sec)
- 8-bit model has slight edge in prompt processing (233.7 vs 220.6 tokens/sec)
- Overall: 4-bit model ~8% faster for complete tasks

**💾 Memory Analysis:**
- 4-bit model uses 0.9 GB less memory (11.3 vs 12.2 GB)
- 4-bit model 17.4% more memory efficient
- Critical advantage for memory-constrained environments

**⚖️ Performance Trade-offs:**
- **4-bit**: Better speed, lower memory, higher efficiency
- **8-bit**: Better prompt processing, potentially higher quality

#### Model Recommendations

**For Speed & Efficiency:** Choose **4-bit Quantized** - 8% faster, 17% more memory efficient
**For Quality Focus:** Choose **8-bit Quantized** - Better for complex reasoning tasks
**For Memory Constraints:** Choose **4-bit Quantized** - Lower memory footprint
**Best Overall Choice:** **4-bit Quantized** - Optimal balance for Apple Silicon

## 🔧 Technical Notes

### MLX Framework Benefits
- **Native Apple Silicon Optimization:** Leverages Neural Engine and GPU
- **Unified Memory Architecture:** Efficient memory management
- **Low Latency:** Optimized for real-time inference
- **Quantization Support:** 4-bit and 8-bit quantization for different use cases

### Model Architecture
- **Base Model:** GPT-OSS-20B (OpenAI's 20B parameter model)
- **Quantization:** Mixed precision quantization
- **Context Length:** Up to 131,072 tokens
- **Architecture:** Mixture of Experts (MoE) with sliding attention

### Performance Characteristics
- **4-bit Quantization:** Lower memory usage, slightly faster inference
- **8-bit Quantization:** Higher quality, balanced performance
- **Memory Requirements:** 16GB+ RAM recommended, 32GB+ optimal
- **Storage Requirements:** ~40GB per quantized model

## 🌟 Community Insights

### Real-World Performance
This benchmark demonstrates exceptional performance of GPT-OSS-20B on Apple Silicon M2 Max:

**🏆 Performance Highlights:**
- **87.9 tokens/second** average generation speed across both models
- **11.8 GB** average peak memory usage (very efficient for 20B model)
- **< 0.1 seconds** time to first token (excellent responsiveness)
- **220+ tokens/second** prompt processing speed

**📊 Model-Specific Performance:**
- **4-bit Model**: 91.5 tokens/sec generation, 11.3 GB memory
- **8-bit Model**: 84.2 tokens/sec generation, 12.2 GB memory
- **Best Overall**: 4-bit model with 8% speed advantage

### Use Case Recommendations

**🚀 For Speed & Efficiency:**
- **Real-time Applications:** 4-bit model (91.5 tokens/sec)
- **API Services:** 4-bit model (faster response times)
- **Batch Processing:** 4-bit model (better throughput)

**🎯 For Quality & Accuracy:**
- **Content Creation:** 8-bit model (potentially higher quality)
- **Complex Reasoning:** 8-bit model (better for nuanced tasks)
- **Code Generation:** 8-bit model (potentially more accurate)

**💾 For Memory Constraints:**
- **16GB Macs:** 4-bit model essential (11.3 GB vs 12.2 GB)
- **32GB Macs:** Both models work well
- **Memory Optimization:** 4-bit model saves ~900MB

### Performance Scaling Insights

**🔥 Exceptional Apple Silicon Performance:**
- MLX framework delivers **native optimization** for M2/M3 chips
- **Unified Memory** architecture fully utilized
- **Neural Engine** acceleration provides speed boost
- **Quantization efficiency** enables 20B model on consumer hardware

**⚡ Real-World Benchmarks:**
- **Prompt processing**: 220+ tokens/sec (excellent)
- **Generation speed**: 84-92 tokens/sec (industry-leading)
- **Memory efficiency**: < 12 GB for 20B parameters (remarkable)
- **Responsiveness**: < 100ms first token (interactive-feeling)

## 📈 Summary Statistics

**Performance Summary:**
- ✅ **4-bit Model**: 91.5 tokens/sec generation, 11.3 GB memory
- ✅ **8-bit Model**: 84.2 tokens/sec generation, 12.2 GB memory
- ✅ **Winner**: 4-bit model (8% faster, 17% more memory efficient)
- ✅ **Hardware**: Apple M2 Max with 32GB unified memory
- ✅ **Framework**: MLX 0.29.0 (optimized for Apple Silicon)

**Key Achievements:**
- 🏆 **Industry-leading performance** on consumer hardware
- 🏆 **Memory efficiency** enabling 20B model on laptops
- 🏆 **Real-time responsiveness** with <100ms first token
- 🏆 **Native Apple Silicon optimization** through MLX

---

*Report generated by MLX Performance Benchmark Suite*  
*Hardware: Apple M2 Max (12-core) | Framework: MLX 0.29.0 | Model: GPT-OSS-20B*  


## Use with mlx

```bash
pip install mlx-lm
```

```python
from mlx_lm import load, generate

model, tokenizer = load("Jackrong/gpt-oss-20b-MLX-8bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
```

# 🚀 GPT-OSS-20B MLX 性能测试 - Apple Silicon

## 📋 执行摘要

**测试日期：** 2025-08-31T08:47:56.723392
**测试问题：** 机器会思考吗？
**硬件平台：** Apple Silicon Mac (M2 Max, 32GB RAM)
**框架版本：** MLX 0.29.0 (Apple's Machine Learning Framework)

## 🖥️ 硬件规格

### 系统信息
- **macOS 版本：** 15.6.1 (Build: 24G90)
- **芯片型号：** Apple M2 Max
- **核心总数：** 12个核心 (8个性能核心 + 4个能效核心)
- **架构类型：** arm64 (Apple Silicon)
- **Python 版本：** 3.10.12

### 内存配置
- **总内存：** 32.0 GB
- **可用内存：** 12.24 GB
- **已用内存：** 19.76 GB (使用率61.7%)
- **内存类型：** 统一内存 (LPDDR5)

### 存储空间
- **主硬盘：** 926.4 GB SSD 总容量，28.2 GB 可用空间 (使用率27.1%)

## 📊 性能基准测试

### 测试配置
- **温度参数：** 1.0 (确定性生成)
- **测试token数：** 200个token生成
- **提示词长度：** 90个token
- **上下文窗口：** 2048个token
- **框架版本：** MLX 0.29.0

### 4-bit 量化模型性能

| 指标 | 数值 | 详情 |
|------|------|------|
| **提示词处理** | 220.6 tokens/sec | 处理90个token |
| **生成速度** | 91.5 tokens/sec | 生成200个token |
| **总耗时** | ~2.18秒 | 包含提示词处理时间 |
| **首token时间** | < 0.1秒 | 响应非常快速 |
| **峰值内存使用** | 11.3 GB | 内存利用效率高 |
| **内存效率** | 8.1 tokens/sec/GB | 效率评分很高 |

**性能说明：**
- 提示词处理速度优秀 (220+ tokens/sec)
- 生成性能稳定 (91.5 tokens/sec)
- 20B参数模型的内存占用较低
- 适合内存受限的环境

### 8-bit 量化模型性能

| 指标 | 数值 | 详情 |
|------|------|------|
| **提示词处理** | 233.7 tokens/sec | 处理90个token |
| **生成速度** | 84.2 tokens/sec | 生成200个token |
| **总耗时** | ~2.37秒 | 包含提示词处理时间 |
| **首token时间** | < 0.1秒 | 响应非常快速 |
| **峰值内存使用** | 12.2 GB | 内存使用量较高 |
| **内存效率** | 6.9 tokens/sec/GB | 效率良好 |

**性能说明：**
- 提示词处理速度最快 (233+ tokens/sec)
- 生成性能稳健 (84.2 tokens/sec)
- 内存需求较高但质量潜力更好
- 适合注重质量的应用场景

### 对比分析

#### 性能对比表格

| 指标 | 4-bit 量化 | 8-bit 量化 | 优胜者 | 改进幅度 |
|------|-----------|-----------|--------|----------|
| **提示词速度** | 220.6 tokens/sec | 233.7 tokens/sec | 8-bit | +6.0% |
| **生成速度** | 91.5 tokens/sec | 84.2 tokens/sec | 4-bit | +8.7% |
| **总耗时(200 tokens)** | ~2.18s | ~2.37s | 4-bit | -8.0% |
| **峰值内存** | 11.3 GB | 12.2 GB | 4-bit | -7.4% |
| **内存效率** | 8.1 tokens/sec/GB | 6.9 tokens/sec/GB | 4-bit | +17.4% |

#### 关键性能洞察

**🚀 速度分析：**
- 4-bit模型在生成速度上表现出色 (91.5 vs 84.2 tokens/sec)
- 8-bit模型在提示词处理上略有优势 (233.7 vs 220.6 tokens/sec)
- 总体而言：4-bit模型在完整任务中快约8%

**💾 内存分析：**
- 4-bit模型比8-bit模型少使用0.9 GB内存 (11.3 vs 12.2 GB)
- 4-bit模型内存效率高出17.4%
- 在内存受限环境中具有关键优势

**⚖️ 性能权衡：**
- **4-bit**：速度更快，内存占用更少，效率更高
- **8-bit**：提示词处理更好，质量潜力可能更高

#### 模型推荐

**速度与效率优先：** 选择 **4-bit 量化** - 速度快8%，内存效率高17%
**质量重点关注：** 选择 **8-bit 量化** - 适合复杂推理任务
**内存受限场景：** 选择 **4-bit 量化** - 内存占用更少
**最佳整体选择：** **4-bit 量化** - Apple Silicon的最优平衡

## 🔧 技术说明

### MLX框架优势
- **原生Apple Silicon优化：** 充分利用神经引擎和GPU
- **统一内存架构：** 高效的内存管理
- **低延迟：** 针对实时推理优化
- **量化支持：** 支持4-bit和8-bit量化以适应不同用例

### 模型架构
- **基础模型：** GPT-OSS-20B (OpenAI的200亿参数模型)
- **量化方式：** 混合精度量化
- **上下文长度：** 最多可达131,072个token
- **架构设计：** 专家混合(MoE)和滑动注意力

### 性能特征
- **4-bit量化：** 内存占用更少，推理速度稍快
- **8-bit量化：** 质量更高，性能均衡
- **内存需求：** 推荐16GB+ RAM，最佳32GB+
- **存储需求：** 每个量化模型约40GB

## 🌟 社区洞察

### 实际性能表现
这个基准测试展示了GPT-OSS-20B在Apple Silicon M2 Max上的卓越性能：

**🏆 性能亮点：**
- **87.9 tokens/秒** 两个模型的平均生成速度
- **11.8 GB** 平均峰值内存使用量 (对20B模型非常高效)
- **< 0.1秒** 首token生成时间 (响应性极佳)
- **220+ tokens/秒** 提示词处理速度

**📊 模型特定性能：**
- **4-bit模型**：91.5 tokens/sec生成速度，11.3 GB内存
- **8-bit模型**：84.2 tokens/sec生成速度，12.2 GB内存
- **最佳整体**：4-bit模型，速度优势达8%

### 使用场景推荐

**🚀 速度与效率优先：**
- **实时应用：** 4-bit模型 (91.5 tokens/sec)
- **API服务：** 4-bit模型 (响应时间更快)
- **批量处理：** 4-bit模型 (吞吐量更好)

**🎯 质量与准确性优先：**
- **内容创作：** 8-bit模型 (质量可能更高)
- **复杂推理：** 8-bit模型 (适合细致任务)
- **代码生成：** 8-bit模型 (准确性可能更高)

**💾 内存受限场景：**
- **16GB Mac：** 必须使用4-bit模型 (11.3 GB vs 12.2 GB)
- **32GB Mac：** 两个模型都可以良好运行
- **内存优化：** 4-bit模型节省约900MB

### 性能扩展洞察

**🔥 Apple Silicon卓越性能：**
- MLX框架为M2/M3芯片提供**原生优化**
- **统一内存**架构得到充分利用
- **神经引擎**加速提供速度提升
- **量化效率**使20B模型可在消费级硬件上运行

**⚡ 实际基准数据：**
- **提示词处理**：220+ tokens/sec (优秀)
- **生成速度**：84-92 tokens/sec (行业领先)
- **内存效率**：20B参数模型<12 GB内存 (卓越)
- **响应性**：<100ms首token (交互式体验)

### 未来优化方向
- **Metal Performance Shaders**集成以获得GPU加速
- **神经引擎**利用率改进
- **高级量化**技术 (3-bit，混合精度)
- **上下文缓存**优化以处理重复提示
- **推测解码**以实现更快速推理
- **模型并行**以支持更大上下文
- 
---

## 📈 总结统计

**性能汇总：**
- ✅ **4-bit模型**：91.5 tokens/sec生成速度，11.3 GB内存
- ✅ **8-bit模型**：84.2 tokens/sec生成速度，12.2 GB内存
- ✅ **优胜者**：4-bit模型 (速度快8%，内存效率高17%)
- ✅ **硬件平台**：配备32GB统一内存的Apple M2 Max
- ✅ **框架版本**：MLX 0.29.0 (针对Apple Silicon优化)

**关键成就：**
- 🏆 **行业领先性能** 在消费级硬件上实现
- 🏆 **内存效率** 使20B模型可在笔记本电脑上运行
- 🏆 **实时响应性** 首token时间<100ms
- 🏆 **原生Apple Silicon优化** 通过MLX框架实现

---

*报告由MLX性能基准测试套件生成*
*硬件：Apple M2 Max (12核) | 框架：MLX 0.29.0 | 模型：GPT-OSS-20B*
*日期：2025-08-31 | 测试时长：每个模型200个token | 准确性：已验证*