Update README.md

e145827 verified 9 days ago

16 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	library_name: mlx
	tags:
	- text-generation
	- mlx
	- apple-silicon
	- gpt
	- quantized
	- 8bit-quantization
	pipeline_tag: text-generation
	base_model: openai/gpt-oss-20b
	model-index:
	- name: gpt-oss-20b-MLX-8bit
	results:
	- task:
	type: text-generation
	dataset:
	name: GPT-OSS-20B Evaluation
	type: openai/gpt-oss-20b
	metrics:
	- type: bits_per_weight
	value: 4.619
	name: Bits per weight (8-bit)
	---

	# Jackrong/gpt-oss-20b-MLX-8bit

	This model [Jackrong/gpt-oss-20b-MLX-8bit](https://huggingface.co/Jackrong/gpt-oss-20b-MLX-8bit) was
	converted to MLX format from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
	using mlx-lm version 0.27.0.

	# 🚀 GPT-OSS-20B MLX Performance Report - Apple Silicon

	## 📋 Executive Summary

	Test Date: 2025-08-31T08:37:22.914637
	Test Query: Do machines possess the ability to think?

	Hardware: Apple Silicon MacBookPro
	Framework: MLX (Apple's Machine Learning Framework)

	## 🖥️ Hardware Specifications

	### System Information
	- macOS Version: 15.6.1 (Build: 24G90)
	- Chip Model: Apple M2 Max
	- Total Cores: 12 cores (8 performance + 4 efficiency) 30 cores GPU
	- Architecture: arm64 (Apple Silicon)
	- Python Version: 3.10.12

	### Memory Configuration
	- Total RAM: 32.0 GB
	- Available RAM: 12.24 GB
	- Used RAM: 19.76 GB (61.7% utilization)
	- Memory Type: Unified Memory (LPDDR5)

	### Storage
	- Main Disk: 926.4 GB SSD total, 28.2 GB free (27.1% used)
	## 📊 Performance Benchmarks

	### Test Configuration
	- Temperature: 1.0 (deterministic generation)
	- Test Tokens: 200 tokens generation
	- Prompt Length: 90 tokens
	- Context Window: 2048 tokens
	- Framework: MLX 0.29.0

	### 4-bit Quantized Model Performance

	\| Metric \| Value \| Details \|
	\|--------\|-------\|---------\|
	\| Prompt Processing \| 220.6 tokens/sec \| 90 tokens processed \|
	\| Generation Speed \| 91.5 tokens/sec \| 200 tokens generated \|
	\| Total Time \| ~2.18 seconds \| Including prompt processing \|
	\| Time to First Token \| < 0.1 seconds \| Very fast response \|
	\| Peak Memory Usage \| 11.3 GB \| Efficient memory utilization \|
	\| Memory Efficiency \| 8.1 tokens/sec per GB \| High efficiency score \|

	Performance Notes:
	- Excellent prompt processing speed (220+ tokens/sec)
	- Consistent generation performance (91.5 tokens/sec)
	- Low memory footprint for 20B parameter model
	- Optimal for memory-constrained environments

	### 8-bit Quantized Model Performance

	\| Metric \| Value \| Details \|
	\|--------\|-------\|---------\|
	\| Prompt Processing \| 233.7 tokens/sec \| 90 tokens processed \|
	\| Generation Speed \| 84.2 tokens/sec \| 200 tokens generated \|
	\| Total Time \| ~2.37 seconds \| Including prompt processing \|
	\| Time to First Token \| < 0.1 seconds \| Very fast response \|
	\| Peak Memory Usage \| 12.2 GB \| Higher memory usage \|
	\| Memory Efficiency \| 6.9 tokens/sec per GB \| Good efficiency \|

	Performance Notes:
	- Fastest prompt processing (233+ tokens/sec)
	- Solid generation performance (84.2 tokens/sec)
	- Higher memory requirements but better quality potential
	- Good balance for quality-focused applications

	### Comparative Analysis

	#### Performance Comparison Table

	\| Metric \| 4-bit Quantized \| 8-bit Quantized \| Winner \| Improvement \|
	\|--------\|----------------\|-----------------\|--------\|-------------\|
	\| Prompt Speed \| 220.6 tokens/sec \| 233.7 tokens/sec \| 8-bit \| +6.0% \|
	\| Generation Speed \| 91.5 tokens/sec \| 84.2 tokens/sec \| 4-bit \| +8.7% \|
	\| Total Time (200 tokens) \| ~2.18s \| ~2.37s \| 4-bit \| -8.0% \|
	\| Peak Memory \| 11.3 GB \| 12.2 GB \| 4-bit \| -7.4% \|
	\| Memory Efficiency \| 8.1 tokens/sec/GB \| 6.9 tokens/sec/GB \| 4-bit \| +17.4% \|

	#### Key Performance Insights

	🚀 Speed Analysis:
	- 4-bit model excels in generation speed (91.5 vs 84.2 tokens/sec)
	- 8-bit model has slight edge in prompt processing (233.7 vs 220.6 tokens/sec)
	- Overall: 4-bit model ~8% faster for complete tasks

	💾 Memory Analysis:
	- 4-bit model uses 0.9 GB less memory (11.3 vs 12.2 GB)
	- 4-bit model 17.4% more memory efficient
	- Critical advantage for memory-constrained environments

	⚖️ Performance Trade-offs:
	- 4-bit: Better speed, lower memory, higher efficiency
	- 8-bit: Better prompt processing, potentially higher quality

	#### Model Recommendations

	For Speed & Efficiency: Choose 4-bit Quantized - 8% faster, 17% more memory efficient
	For Quality Focus: Choose 8-bit Quantized - Better for complex reasoning tasks
	For Memory Constraints: Choose 4-bit Quantized - Lower memory footprint
	Best Overall Choice: 4-bit Quantized - Optimal balance for Apple Silicon

	## 🔧 Technical Notes

	### MLX Framework Benefits
	- Native Apple Silicon Optimization: Leverages Neural Engine and GPU
	- Unified Memory Architecture: Efficient memory management
	- Low Latency: Optimized for real-time inference
	- Quantization Support: 4-bit and 8-bit quantization for different use cases

	### Model Architecture
	- Base Model: GPT-OSS-20B (OpenAI's 20B parameter model)
	- Quantization: Mixed precision quantization
	- Context Length: Up to 131,072 tokens
	- Architecture: Mixture of Experts (MoE) with sliding attention

	### Performance Characteristics
	- 4-bit Quantization: Lower memory usage, slightly faster inference
	- 8-bit Quantization: Higher quality, balanced performance
	- Memory Requirements: 16GB+ RAM recommended, 32GB+ optimal
	- Storage Requirements: ~40GB per quantized model

	## 🌟 Community Insights

	### Real-World Performance
	This benchmark demonstrates exceptional performance of GPT-OSS-20B on Apple Silicon M2 Max:

	🏆 Performance Highlights:
	- 87.9 tokens/second average generation speed across both models
	- 11.8 GB average peak memory usage (very efficient for 20B model)
	- < 0.1 seconds time to first token (excellent responsiveness)
	- 220+ tokens/second prompt processing speed

	📊 Model-Specific Performance:
	- 4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
	- 8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
	- Best Overall: 4-bit model with 8% speed advantage

	### Use Case Recommendations

	🚀 For Speed & Efficiency:
	- Real-time Applications: 4-bit model (91.5 tokens/sec)
	- API Services: 4-bit model (faster response times)
	- Batch Processing: 4-bit model (better throughput)

	🎯 For Quality & Accuracy:
	- Content Creation: 8-bit model (potentially higher quality)
	- Complex Reasoning: 8-bit model (better for nuanced tasks)
	- Code Generation: 8-bit model (potentially more accurate)

	💾 For Memory Constraints:
	- 16GB Macs: 4-bit model essential (11.3 GB vs 12.2 GB)
	- 32GB Macs: Both models work well
	- Memory Optimization: 4-bit model saves ~900MB

	### Performance Scaling Insights

	🔥 Exceptional Apple Silicon Performance:
	- MLX framework delivers native optimization for M2/M3 chips
	- Unified Memory architecture fully utilized
	- Neural Engine acceleration provides speed boost
	- Quantization efficiency enables 20B model on consumer hardware

	⚡ Real-World Benchmarks:
	- Prompt processing: 220+ tokens/sec (excellent)
	- Generation speed: 84-92 tokens/sec (industry-leading)
	- Memory efficiency: < 12 GB for 20B parameters (remarkable)
	- Responsiveness: < 100ms first token (interactive-feeling)

	## 📈 Summary Statistics

	Performance Summary:
	- ✅ 4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
	- ✅ 8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
	- ✅ Winner: 4-bit model (8% faster, 17% more memory efficient)
	- ✅ Hardware: Apple M2 Max with 32GB unified memory
	- ✅ Framework: MLX 0.29.0 (optimized for Apple Silicon)

	Key Achievements:
	- 🏆 Industry-leading performance on consumer hardware
	- 🏆 Memory efficiency enabling 20B model on laptops
	- 🏆 Real-time responsiveness with <100ms first token
	- 🏆 Native Apple Silicon optimization through MLX

	---

	Report generated by MLX Performance Benchmark Suite
	Hardware: Apple M2 Max (12-core) \| Framework: MLX 0.29.0 \| Model: GPT-OSS-20B


	## Use with mlx

	```bash
	pip install mlx-lm
	```

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("Jackrong/gpt-oss-20b-MLX-8bit")

	prompt = "hello"

	if tokenizer.chat_template is not None:
	messages = [{"role": "user", "content": prompt}]
	prompt = tokenizer.apply_chat_template(
	messages, add_generation_prompt=True
	)

	response = generate(model, tokenizer, prompt=prompt, verbose=True)
	```

	# 🚀 GPT-OSS-20B MLX 性能测试 - Apple Silicon

	## 📋 执行摘要

	测试日期： 2025-08-31T08:47:56.723392
	测试问题：机器会思考吗？
	硬件平台： Apple Silicon Mac (M2 Max, 32GB RAM)
	框架版本： MLX 0.29.0 (Apple's Machine Learning Framework)

	## 🖥️ 硬件规格

	### 系统信息
	- macOS 版本： 15.6.1 (Build: 24G90)
	- 芯片型号： Apple M2 Max
	- 核心总数： 12个核心 (8个性能核心 + 4个能效核心)
	- 架构类型： arm64 (Apple Silicon)
	- Python 版本： 3.10.12

	### 内存配置
	- 总内存： 32.0 GB
	- 可用内存： 12.24 GB
	- 已用内存： 19.76 GB (使用率61.7%)
	- 内存类型：统一内存 (LPDDR5)

	### 存储空间
	- 主硬盘： 926.4 GB SSD 总容量，28.2 GB 可用空间 (使用率27.1%)

	## 📊 性能基准测试

	### 测试配置
	- 温度参数： 1.0 (确定性生成)
	- 测试token数： 200个token生成
	- 提示词长度： 90个token
	- 上下文窗口： 2048个token
	- 框架版本： MLX 0.29.0

	### 4-bit 量化模型性能

	\| 指标 \| 数值 \| 详情 \|
	\|------\|------\|------\|
	\| 提示词处理 \| 220.6 tokens/sec \| 处理90个token \|
	\| 生成速度 \| 91.5 tokens/sec \| 生成200个token \|
	\| 总耗时 \| ~2.18秒 \| 包含提示词处理时间 \|
	\| 首token时间 \| < 0.1秒 \| 响应非常快速 \|
	\| 峰值内存使用 \| 11.3 GB \| 内存利用效率高 \|
	\| 内存效率 \| 8.1 tokens/sec/GB \| 效率评分很高 \|

	性能说明：
	- 提示词处理速度优秀 (220+ tokens/sec)
	- 生成性能稳定 (91.5 tokens/sec)
	- 20B参数模型的内存占用较低
	- 适合内存受限的环境

	### 8-bit 量化模型性能

	\| 指标 \| 数值 \| 详情 \|
	\|------\|------\|------\|
	\| 提示词处理 \| 233.7 tokens/sec \| 处理90个token \|
	\| 生成速度 \| 84.2 tokens/sec \| 生成200个token \|
	\| 总耗时 \| ~2.37秒 \| 包含提示词处理时间 \|
	\| 首token时间 \| < 0.1秒 \| 响应非常快速 \|
	\| 峰值内存使用 \| 12.2 GB \| 内存使用量较高 \|
	\| 内存效率 \| 6.9 tokens/sec/GB \| 效率良好 \|

	性能说明：
	- 提示词处理速度最快 (233+ tokens/sec)
	- 生成性能稳健 (84.2 tokens/sec)
	- 内存需求较高但质量潜力更好
	- 适合注重质量的应用场景

	### 对比分析

	#### 性能对比表格

	\| 指标 \| 4-bit 量化 \| 8-bit 量化 \| 优胜者 \| 改进幅度 \|
	\|------\|-----------\|-----------\|--------\|----------\|
	\| 提示词速度 \| 220.6 tokens/sec \| 233.7 tokens/sec \| 8-bit \| +6.0% \|
	\| 生成速度 \| 91.5 tokens/sec \| 84.2 tokens/sec \| 4-bit \| +8.7% \|
	\| 总耗时(200 tokens) \| ~2.18s \| ~2.37s \| 4-bit \| -8.0% \|
	\| 峰值内存 \| 11.3 GB \| 12.2 GB \| 4-bit \| -7.4% \|
	\| 内存效率 \| 8.1 tokens/sec/GB \| 6.9 tokens/sec/GB \| 4-bit \| +17.4% \|

	#### 关键性能洞察

	🚀 速度分析：
	- 4-bit模型在生成速度上表现出色 (91.5 vs 84.2 tokens/sec)
	- 8-bit模型在提示词处理上略有优势 (233.7 vs 220.6 tokens/sec)
	- 总体而言：4-bit模型在完整任务中快约8%

	💾 内存分析：
	- 4-bit模型比8-bit模型少使用0.9 GB内存 (11.3 vs 12.2 GB)
	- 4-bit模型内存效率高出17.4%
	- 在内存受限环境中具有关键优势

	⚖️ 性能权衡：
	- 4-bit：速度更快，内存占用更少，效率更高
	- 8-bit：提示词处理更好，质量潜力可能更高

	#### 模型推荐

	速度与效率优先：选择 4-bit 量化 - 速度快8%，内存效率高17%
	质量重点关注：选择 8-bit 量化 - 适合复杂推理任务
	内存受限场景：选择 4-bit 量化 - 内存占用更少
	最佳整体选择： 4-bit 量化 - Apple Silicon的最优平衡

	## 🔧 技术说明

	### MLX框架优势
	- 原生Apple Silicon优化：充分利用神经引擎和GPU
	- 统一内存架构：高效的内存管理
	- 低延迟：针对实时推理优化
	- 量化支持：支持4-bit和8-bit量化以适应不同用例

	### 模型架构
	- 基础模型： GPT-OSS-20B (OpenAI的200亿参数模型)
	- 量化方式：混合精度量化
	- 上下文长度：最多可达131,072个token
	- 架构设计：专家混合(MoE)和滑动注意力

	### 性能特征
	- 4-bit量化：内存占用更少，推理速度稍快
	- 8-bit量化：质量更高，性能均衡
	- 内存需求：推荐16GB+ RAM，最佳32GB+
	- 存储需求：每个量化模型约40GB

	## 🌟 社区洞察

	### 实际性能表现
	这个基准测试展示了GPT-OSS-20B在Apple Silicon M2 Max上的卓越性能：

	🏆 性能亮点：
	- 87.9 tokens/秒两个模型的平均生成速度
	- 11.8 GB 平均峰值内存使用量 (对20B模型非常高效)
	- < 0.1秒首token生成时间 (响应性极佳)
	- 220+ tokens/秒提示词处理速度

	📊 模型特定性能：
	- 4-bit模型：91.5 tokens/sec生成速度，11.3 GB内存
	- 8-bit模型：84.2 tokens/sec生成速度，12.2 GB内存
	- 最佳整体：4-bit模型，速度优势达8%

	### 使用场景推荐

	🚀 速度与效率优先：
	- 实时应用： 4-bit模型 (91.5 tokens/sec)
	- API服务： 4-bit模型 (响应时间更快)
	- 批量处理： 4-bit模型 (吞吐量更好)

	🎯 质量与准确性优先：
	- 内容创作： 8-bit模型 (质量可能更高)
	- 复杂推理： 8-bit模型 (适合细致任务)
	- 代码生成： 8-bit模型 (准确性可能更高)

	💾 内存受限场景：
	- 16GB Mac：必须使用4-bit模型 (11.3 GB vs 12.2 GB)
	- 32GB Mac：两个模型都可以良好运行
	- 内存优化： 4-bit模型节省约900MB

	### 性能扩展洞察

	🔥 Apple Silicon卓越性能：
	- MLX框架为M2/M3芯片提供原生优化
	- 统一内存架构得到充分利用
	- 神经引擎加速提供速度提升
	- 量化效率使20B模型可在消费级硬件上运行

	⚡ 实际基准数据：
	- 提示词处理：220+ tokens/sec (优秀)
	- 生成速度：84-92 tokens/sec (行业领先)
	- 内存效率：20B参数模型<12 GB内存 (卓越)
	- 响应性：<100ms首token (交互式体验)

	### 未来优化方向
	- Metal Performance Shaders集成以获得GPU加速
	- 神经引擎利用率改进
	- 高级量化技术 (3-bit，混合精度)
	- 上下文缓存优化以处理重复提示
	- 推测解码以实现更快速推理
	- 模型并行以支持更大上下文
	-
	---

	## 📈 总结统计

	性能汇总：
	- ✅ 4-bit模型：91.5 tokens/sec生成速度，11.3 GB内存
	- ✅ 8-bit模型：84.2 tokens/sec生成速度，12.2 GB内存
	- ✅ 优胜者：4-bit模型 (速度快8%，内存效率高17%)
	- ✅ 硬件平台：配备32GB统一内存的Apple M2 Max
	- ✅ 框架版本：MLX 0.29.0 (针对Apple Silicon优化)

	关键成就：
	- 🏆 行业领先性能在消费级硬件上实现
	- 🏆 内存效率使20B模型可在笔记本电脑上运行
	- 🏆 实时响应性首token时间<100ms
	- 🏆 原生Apple Silicon优化通过MLX框架实现

	---

	报告由MLX性能基准测试套件生成
	硬件：Apple M2 Max (12核) \| 框架：MLX 0.29.0 \| 模型：GPT-OSS-20B
	日期：2025-08-31 \| 测试时长：每个模型200个token \| 准确性：已验证