Jackrong/gpt-oss-20b-MLX-8bit

This model Jackrong/gpt-oss-20b-MLX-8bit was converted to MLX format from openai/gpt-oss-20b using mlx-lm version 0.27.0.

🚀 GPT-OSS-20B MLX Performance Report - Apple Silicon

📋 Executive Summary

Test Date: 2025-08-31T08:37:22.914637
Test Query: Do machines possess the ability to think?

Hardware: Apple Silicon MacBookPro Framework: MLX (Apple's Machine Learning Framework)

🖥️ Hardware Specifications

System Information

  • macOS Version: 15.6.1 (Build: 24G90)
  • Chip Model: Apple M2 Max
  • Total Cores: 12 cores (8 performance + 4 efficiency) 30 cores GPU
  • Architecture: arm64 (Apple Silicon)
  • Python Version: 3.10.12

Memory Configuration

  • Total RAM: 32.0 GB
  • Available RAM: 12.24 GB
  • Used RAM: 19.76 GB (61.7% utilization)
  • Memory Type: Unified Memory (LPDDR5)

Storage

  • Main Disk: 926.4 GB SSD total, 28.2 GB free (27.1% used)

📊 Performance Benchmarks

Test Configuration

  • Temperature: 1.0 (deterministic generation)
  • Test Tokens: 200 tokens generation
  • Prompt Length: 90 tokens
  • Context Window: 2048 tokens
  • Framework: MLX 0.29.0

4-bit Quantized Model Performance

Metric Value Details
Prompt Processing 220.6 tokens/sec 90 tokens processed
Generation Speed 91.5 tokens/sec 200 tokens generated
Total Time ~2.18 seconds Including prompt processing
Time to First Token < 0.1 seconds Very fast response
Peak Memory Usage 11.3 GB Efficient memory utilization
Memory Efficiency 8.1 tokens/sec per GB High efficiency score

Performance Notes:

  • Excellent prompt processing speed (220+ tokens/sec)
  • Consistent generation performance (91.5 tokens/sec)
  • Low memory footprint for 20B parameter model
  • Optimal for memory-constrained environments

8-bit Quantized Model Performance

Metric Value Details
Prompt Processing 233.7 tokens/sec 90 tokens processed
Generation Speed 84.2 tokens/sec 200 tokens generated
Total Time ~2.37 seconds Including prompt processing
Time to First Token < 0.1 seconds Very fast response
Peak Memory Usage 12.2 GB Higher memory usage
Memory Efficiency 6.9 tokens/sec per GB Good efficiency

Performance Notes:

  • Fastest prompt processing (233+ tokens/sec)
  • Solid generation performance (84.2 tokens/sec)
  • Higher memory requirements but better quality potential
  • Good balance for quality-focused applications

Comparative Analysis

Performance Comparison Table

Metric 4-bit Quantized 8-bit Quantized Winner Improvement
Prompt Speed 220.6 tokens/sec 233.7 tokens/sec 8-bit +6.0%
Generation Speed 91.5 tokens/sec 84.2 tokens/sec 4-bit +8.7%
Total Time (200 tokens) ~2.18s ~2.37s 4-bit -8.0%
Peak Memory 11.3 GB 12.2 GB 4-bit -7.4%
Memory Efficiency 8.1 tokens/sec/GB 6.9 tokens/sec/GB 4-bit +17.4%

Key Performance Insights

🚀 Speed Analysis:

  • 4-bit model excels in generation speed (91.5 vs 84.2 tokens/sec)
  • 8-bit model has slight edge in prompt processing (233.7 vs 220.6 tokens/sec)
  • Overall: 4-bit model ~8% faster for complete tasks

💾 Memory Analysis:

  • 4-bit model uses 0.9 GB less memory (11.3 vs 12.2 GB)
  • 4-bit model 17.4% more memory efficient
  • Critical advantage for memory-constrained environments

⚖️ Performance Trade-offs:

  • 4-bit: Better speed, lower memory, higher efficiency
  • 8-bit: Better prompt processing, potentially higher quality

Model Recommendations

For Speed & Efficiency: Choose 4-bit Quantized - 8% faster, 17% more memory efficient For Quality Focus: Choose 8-bit Quantized - Better for complex reasoning tasks For Memory Constraints: Choose 4-bit Quantized - Lower memory footprint Best Overall Choice: 4-bit Quantized - Optimal balance for Apple Silicon

🔧 Technical Notes

MLX Framework Benefits

  • Native Apple Silicon Optimization: Leverages Neural Engine and GPU
  • Unified Memory Architecture: Efficient memory management
  • Low Latency: Optimized for real-time inference
  • Quantization Support: 4-bit and 8-bit quantization for different use cases

Model Architecture

  • Base Model: GPT-OSS-20B (OpenAI's 20B parameter model)
  • Quantization: Mixed precision quantization
  • Context Length: Up to 131,072 tokens
  • Architecture: Mixture of Experts (MoE) with sliding attention

Performance Characteristics

  • 4-bit Quantization: Lower memory usage, slightly faster inference
  • 8-bit Quantization: Higher quality, balanced performance
  • Memory Requirements: 16GB+ RAM recommended, 32GB+ optimal
  • Storage Requirements: ~40GB per quantized model

🌟 Community Insights

Real-World Performance

This benchmark demonstrates exceptional performance of GPT-OSS-20B on Apple Silicon M2 Max:

🏆 Performance Highlights:

  • 87.9 tokens/second average generation speed across both models
  • 11.8 GB average peak memory usage (very efficient for 20B model)
  • < 0.1 seconds time to first token (excellent responsiveness)
  • 220+ tokens/second prompt processing speed

📊 Model-Specific Performance:

  • 4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
  • 8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
  • Best Overall: 4-bit model with 8% speed advantage

Use Case Recommendations

🚀 For Speed & Efficiency:

  • Real-time Applications: 4-bit model (91.5 tokens/sec)
  • API Services: 4-bit model (faster response times)
  • Batch Processing: 4-bit model (better throughput)

🎯 For Quality & Accuracy:

  • Content Creation: 8-bit model (potentially higher quality)
  • Complex Reasoning: 8-bit model (better for nuanced tasks)
  • Code Generation: 8-bit model (potentially more accurate)

💾 For Memory Constraints:

  • 16GB Macs: 4-bit model essential (11.3 GB vs 12.2 GB)
  • 32GB Macs: Both models work well
  • Memory Optimization: 4-bit model saves ~900MB

Performance Scaling Insights

🔥 Exceptional Apple Silicon Performance:

  • MLX framework delivers native optimization for M2/M3 chips
  • Unified Memory architecture fully utilized
  • Neural Engine acceleration provides speed boost
  • Quantization efficiency enables 20B model on consumer hardware

⚡ Real-World Benchmarks:

  • Prompt processing: 220+ tokens/sec (excellent)
  • Generation speed: 84-92 tokens/sec (industry-leading)
  • Memory efficiency: < 12 GB for 20B parameters (remarkable)
  • Responsiveness: < 100ms first token (interactive-feeling)

📈 Summary Statistics

Performance Summary:

  • 4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
  • 8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
  • Winner: 4-bit model (8% faster, 17% more memory efficient)
  • Hardware: Apple M2 Max with 32GB unified memory
  • Framework: MLX 0.29.0 (optimized for Apple Silicon)

Key Achievements:

  • 🏆 Industry-leading performance on consumer hardware
  • 🏆 Memory efficiency enabling 20B model on laptops
  • 🏆 Real-time responsiveness with <100ms first token
  • 🏆 Native Apple Silicon optimization through MLX

Report generated by MLX Performance Benchmark Suite
Hardware: Apple M2 Max (12-core) | Framework: MLX 0.29.0 | Model: GPT-OSS-20B

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Jackrong/gpt-oss-20b-MLX-8bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

🚀 GPT-OSS-20B MLX 性能测试 - Apple Silicon

📋 执行摘要

测试日期: 2025-08-31T08:47:56.723392 测试问题: 机器会思考吗? 硬件平台: Apple Silicon Mac (M2 Max, 32GB RAM) 框架版本: MLX 0.29.0 (Apple's Machine Learning Framework)

🖥️ 硬件规格

系统信息

  • macOS 版本: 15.6.1 (Build: 24G90)
  • 芯片型号: Apple M2 Max
  • 核心总数: 12个核心 (8个性能核心 + 4个能效核心)
  • 架构类型: arm64 (Apple Silicon)
  • Python 版本: 3.10.12

内存配置

  • 总内存: 32.0 GB
  • 可用内存: 12.24 GB
  • 已用内存: 19.76 GB (使用率61.7%)
  • 内存类型: 统一内存 (LPDDR5)

存储空间

  • 主硬盘: 926.4 GB SSD 总容量,28.2 GB 可用空间 (使用率27.1%)

📊 性能基准测试

测试配置

  • 温度参数: 1.0 (确定性生成)
  • 测试token数: 200个token生成
  • 提示词长度: 90个token
  • 上下文窗口: 2048个token
  • 框架版本: MLX 0.29.0

4-bit 量化模型性能

指标 数值 详情
提示词处理 220.6 tokens/sec 处理90个token
生成速度 91.5 tokens/sec 生成200个token
总耗时 ~2.18秒 包含提示词处理时间
首token时间 < 0.1秒 响应非常快速
峰值内存使用 11.3 GB 内存利用效率高
内存效率 8.1 tokens/sec/GB 效率评分很高

性能说明:

  • 提示词处理速度优秀 (220+ tokens/sec)
  • 生成性能稳定 (91.5 tokens/sec)
  • 20B参数模型的内存占用较低
  • 适合内存受限的环境

8-bit 量化模型性能

指标 数值 详情
提示词处理 233.7 tokens/sec 处理90个token
生成速度 84.2 tokens/sec 生成200个token
总耗时 ~2.37秒 包含提示词处理时间
首token时间 < 0.1秒 响应非常快速
峰值内存使用 12.2 GB 内存使用量较高
内存效率 6.9 tokens/sec/GB 效率良好

性能说明:

  • 提示词处理速度最快 (233+ tokens/sec)
  • 生成性能稳健 (84.2 tokens/sec)
  • 内存需求较高但质量潜力更好
  • 适合注重质量的应用场景

对比分析

性能对比表格

指标 4-bit 量化 8-bit 量化 优胜者 改进幅度
提示词速度 220.6 tokens/sec 233.7 tokens/sec 8-bit +6.0%
生成速度 91.5 tokens/sec 84.2 tokens/sec 4-bit +8.7%
总耗时(200 tokens) ~2.18s ~2.37s 4-bit -8.0%
峰值内存 11.3 GB 12.2 GB 4-bit -7.4%
内存效率 8.1 tokens/sec/GB 6.9 tokens/sec/GB 4-bit +17.4%

关键性能洞察

🚀 速度分析:

  • 4-bit模型在生成速度上表现出色 (91.5 vs 84.2 tokens/sec)
  • 8-bit模型在提示词处理上略有优势 (233.7 vs 220.6 tokens/sec)
  • 总体而言:4-bit模型在完整任务中快约8%

💾 内存分析:

  • 4-bit模型比8-bit模型少使用0.9 GB内存 (11.3 vs 12.2 GB)
  • 4-bit模型内存效率高出17.4%
  • 在内存受限环境中具有关键优势

⚖️ 性能权衡:

  • 4-bit:速度更快,内存占用更少,效率更高
  • 8-bit:提示词处理更好,质量潜力可能更高

模型推荐

速度与效率优先: 选择 4-bit 量化 - 速度快8%,内存效率高17% 质量重点关注: 选择 8-bit 量化 - 适合复杂推理任务 内存受限场景: 选择 4-bit 量化 - 内存占用更少 最佳整体选择: 4-bit 量化 - Apple Silicon的最优平衡

🔧 技术说明

MLX框架优势

  • 原生Apple Silicon优化: 充分利用神经引擎和GPU
  • 统一内存架构: 高效的内存管理
  • 低延迟: 针对实时推理优化
  • 量化支持: 支持4-bit和8-bit量化以适应不同用例

模型架构

  • 基础模型: GPT-OSS-20B (OpenAI的200亿参数模型)
  • 量化方式: 混合精度量化
  • 上下文长度: 最多可达131,072个token
  • 架构设计: 专家混合(MoE)和滑动注意力

性能特征

  • 4-bit量化: 内存占用更少,推理速度稍快
  • 8-bit量化: 质量更高,性能均衡
  • 内存需求: 推荐16GB+ RAM,最佳32GB+
  • 存储需求: 每个量化模型约40GB

🌟 社区洞察

实际性能表现

这个基准测试展示了GPT-OSS-20B在Apple Silicon M2 Max上的卓越性能:

🏆 性能亮点:

  • 87.9 tokens/秒 两个模型的平均生成速度
  • 11.8 GB 平均峰值内存使用量 (对20B模型非常高效)
  • < 0.1秒 首token生成时间 (响应性极佳)
  • 220+ tokens/秒 提示词处理速度

📊 模型特定性能:

  • 4-bit模型:91.5 tokens/sec生成速度,11.3 GB内存
  • 8-bit模型:84.2 tokens/sec生成速度,12.2 GB内存
  • 最佳整体:4-bit模型,速度优势达8%

使用场景推荐

🚀 速度与效率优先:

  • 实时应用: 4-bit模型 (91.5 tokens/sec)
  • API服务: 4-bit模型 (响应时间更快)
  • 批量处理: 4-bit模型 (吞吐量更好)

🎯 质量与准确性优先:

  • 内容创作: 8-bit模型 (质量可能更高)
  • 复杂推理: 8-bit模型 (适合细致任务)
  • 代码生成: 8-bit模型 (准确性可能更高)

💾 内存受限场景:

  • 16GB Mac: 必须使用4-bit模型 (11.3 GB vs 12.2 GB)
  • 32GB Mac: 两个模型都可以良好运行
  • 内存优化: 4-bit模型节省约900MB

性能扩展洞察

🔥 Apple Silicon卓越性能:

  • MLX框架为M2/M3芯片提供原生优化
  • 统一内存架构得到充分利用
  • 神经引擎加速提供速度提升
  • 量化效率使20B模型可在消费级硬件上运行

⚡ 实际基准数据:

  • 提示词处理:220+ tokens/sec (优秀)
  • 生成速度:84-92 tokens/sec (行业领先)
  • 内存效率:20B参数模型<12 GB内存 (卓越)
  • 响应性:<100ms首token (交互式体验)

未来优化方向

  • Metal Performance Shaders集成以获得GPU加速
  • 神经引擎利用率改进
  • 高级量化技术 (3-bit,混合精度)
  • 上下文缓存优化以处理重复提示
  • 推测解码以实现更快速推理
  • 模型并行以支持更大上下文

📈 总结统计

性能汇总:

  • 4-bit模型:91.5 tokens/sec生成速度,11.3 GB内存
  • 8-bit模型:84.2 tokens/sec生成速度,12.2 GB内存
  • 优胜者:4-bit模型 (速度快8%,内存效率高17%)
  • 硬件平台:配备32GB统一内存的Apple M2 Max
  • 框架版本:MLX 0.29.0 (针对Apple Silicon优化)

关键成就:

  • 🏆 行业领先性能 在消费级硬件上实现
  • 🏆 内存效率 使20B模型可在笔记本电脑上运行
  • 🏆 实时响应性 首token时间<100ms
  • 🏆 原生Apple Silicon优化 通过MLX框架实现

报告由MLX性能基准测试套件生成 硬件:Apple M2 Max (12核) | 框架:MLX 0.29.0 | 模型:GPT-OSS-20B 日期:2025-08-31 | 测试时长:每个模型200个token | 准确性:已验证

Downloads last month
94
Safetensors
Model size
20.9B params
Tensor type
BF16
·
U32
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jackrong/gpt-oss-20b-MLX-8bit

Base model

openai/gpt-oss-20b
Quantized
(90)
this model

Evaluation results

  • Bits per weight (8-bit) on GPT-OSS-20B Evaluation
    self-reported
    4.619