--- title: AusCyberBench Evaluation Dashboard emoji: 🛡️ colorFrom: green colorTo: yellow sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: mit --- # 🇦🇺 AusCyberBench Evaluation Dashboard **Australia's First LLM Cybersecurity Benchmark** An interactive dashboard for evaluating language models on Australian cybersecurity knowledge, regulations, and threat intelligence. ## 🆕 What's New (October 2025) - **32 Tested Models** - Focus on proven, stable models for reliable evaluation - **✅ Recommended Category** - 7 models with verified performance (DeepSeek 55%+, TinyLlama 33%+) - **Enhanced Visuals** - Australian color-coded charts with gold/green ranking system - **Better Stability** - Removed experimental models causing compatibility issues - **Improved UI** - Quick selection presets for recommended, security, and size-based filtering - **Memory Optimized** - Better GPU management for HuggingFace Spaces ## About AusCyberBench AusCyberBench is a comprehensive benchmark dataset containing **13,449 tasks** across six critical categories: ### 📋 Categories - **🛡️ Regulatory: Essential Eight** (2,558 tasks) - ACSC's baseline cybersecurity mitigation strategies - Maturity levels 1-3 across 8 mitigation strategies - Application whitelisting, patching, MFA, backups, etc. - **📜 Regulatory: ISM Controls** (7,200 tasks) - Information Security Manual control requirements - Commonwealth entity security obligations - Control effectiveness, implementation, and compliance - **🔒 Regulatory: Privacy Act** (204 tasks) - Australian Privacy Principles (APPs) - Data protection and privacy obligations - Notifiable Data Breaches (NDB) scheme - **⚡ Regulatory: SOCI Act** (240 tasks) - Security of Critical Infrastructure Act 2018 - Critical infrastructure risk management - Sector-specific obligations - **🎯 Knowledge: Threat Intelligence** (2,520 tasks) - ACSC threat reports and advisories - Australian threat landscape - Cyber incident response - **📚 Knowledge: Terminology** (727 tasks) - Australian cybersecurity terminology - ACSC glossary and definitions - Industry-specific language ## Features ### 🤖 32 Pre-Configured Models (Tested & Stable) Evaluate across diverse model categories with proven, reliable models: #### ✅ Recommended (Tested) - 7 models Models with verified performance on AusCyberBench: - **Phi-3 & Phi-3.5** - Microsoft's efficient models (proven stable) - **Gemma-2-2b** - Google's compact model (tested) - **Qwen2.5** (3B, 7B) - Alibaba's reliable models (good performance) - **DeepSeek LLM-7B** - Previously achieved **55.6% accuracy** ⭐ - **TinyLlama-1.1B** - Previously achieved **33.3% accuracy** #### 🛡️ Cybersecurity-Focused - 5 models - **DeepSeek Coder** - Code-focused with security awareness - **WizardCoder-Python** - Advanced code understanding - **StarCoder2** - BigCode's latest model - **CodeLlama** - Meta's code specialist - **CodeGen25** - Salesforce's code model #### Small Models (1-4B) - 7 models Phi-3 series, Gemma-2, Qwen2.5, Llama 3.2, StableLM, TinyLlama #### Medium Models (7-12B) - 6 models Mistral, Qwen2.5, Llama 3.1, Gemma-2-9b, Mistral-Nemo, Yi #### Reasoning & Analysis - 4 models DeepSeek LLM, SOLAR, Hermes-3, Qwen2.5-14B #### Multilingual & Diverse - 3 models Falcon, OpenChat, OpenHermes ### ⚡ Quick Selection Presets - **✅ Recommended (7)** - Tested models with verified performance - **🛡️ Security Focus (5)** - Code and cybersecurity specialists - **Small/Medium** - Size-based selection (7/6 models) - **Select All (32)** - Comprehensive evaluation - **Clear All** - Reset selection ### 🎯 Customisable Evaluation - **Sample size:** 10-500 tasks (default: 10 for testing multiple models) - **⚠️ GPU Limits:** Free tier has 60s timeout - test 1-2 models at a time for best results - **4-bit quantisation:** Reduce memory usage for larger models - **Temperature:** Control response randomness (0.1-1.0) - **Max tokens:** Limit response length (32-256) ### 📊 Real-Time Results - Live leaderboard with rankings (🥇🥈🥉) - Model comparison visualisation in Australian colours - Per-category performance breakdown - Downloadable results (JSON format) ## Usage ### 💾 Persistent Leaderboard Feature **NEW:** Results now persist across sessions! This solves the GPU timeout issue: - Run models **one at a time** to avoid timeouts - Each run merges with previous results - Best score per model is automatically kept - Build a comprehensive leaderboard incrementally - Perfect for the 60-second free tier limit **Workflow:** 1. Select 1-2 models and run evaluation 2. Results automatically save and merge with leaderboard 3. Select different models and run again 4. Leaderboard updates with all results 5. Use "Clear All Results" button to start fresh ### Standard Usage 1. **Select Models:** Use checkboxes or quick selection buttons 2. **Configure Settings:** Adjust sample size, quantisation, temperature 3. **Run Evaluation:** Click "🚀 Run Evaluation" 4. **Monitor Progress:** Watch real-time progress and intermediate results 5. **Analyse Results:** Review persistent leaderboard, charts, and category breakdowns 6. **Download:** Export results for further analysis ## Dataset The benchmark is available on HuggingFace: 🔗 **[Zen0/AusCyberBench](https://huggingface.co/datasets/Zen0/AusCyberBench)** ### Dataset Splits - **Full:** All 13,449 tasks across all categories - **Australian:** 4,899 Australia-specific tasks ## Evaluation Methodology ### Prompt Formatting Model-specific chat templates ensure optimal performance: - **Phi-3/Phi-3.5:** `<|user|>...<|end|>\n<|assistant|>` - **Gemma-2:** `user\n...\nmodel` - **Generic (Llama, Mistral, Qwen, etc.):** `[INST] ... [/INST]` ### Answer Extraction Robust extraction for multiple-choice tasks: - Primary: Regex pattern `\b([A-D])\b` matching - Fallback: First character validation - Handles various response formats ### Memory Management Automatic cleanup between models: - Model and tokeniser deletion - CUDA cache clearing - Garbage collection - Prevents OOM errors on GPU instances ## Performance Expectations Based on verified benchmarking with tested models: - **✅ Recommended Models:** 30-56% accuracy (**DeepSeek LLM: 55.6%**, **TinyLlama: 33.3%**) - **Cybersecurity-Focused:** 20-40% accuracy (code models show domain understanding) - **Small Models (1-4B):** 10-40% accuracy (Phi-3, Qwen2.5 perform well) - **Medium Models (7-12B):** 25-45% accuracy (Mistral, Llama 3.1 strong performers) - **Reasoning Models:** 30-50% accuracy (DeepSeek, SOLAR excel at complex tasks) Performance varies significantly by category: - **Essential Eight:** Higher scores (25-50%) - well-documented standards - **ISM Controls:** Moderate scores (15-35%) - detailed technical requirements - **Terminology:** Good scores (20-40%) - definition-based tasks - **Threat Intelligence:** Variable (15-45%) - requires current knowledge - **Privacy Act / SOCI Act:** Challenging (15-35%) - complex regulatory understanding ## Technical Requirements This Space requires GPU hardware for model inference. ### ⚡ ZeroGPU Free Tier Limitations **60-Second Timeout:** Free tier has a strict 60-second limit per evaluation session. **Best Practices:** - ✅ **Test 1-2 models at a time** with 10 tasks each (~30-40 seconds total) - ⚠️ **Avoid selecting 5+ models** in one run (will timeout midway) - ✅ **Use 4-bit quantization** for 7B+ models to speed up inference - ✅ **Run separate evaluations** for thorough testing across many models **Example Timing:** - 1 model × 10 tasks: ~15-25 seconds ✅ - 2 models × 10 tasks: ~30-50 seconds ✅ - 5 models × 10 tasks: ~75-125 seconds ❌ Will timeout For comprehensive multi-model benchmarking, run evaluations sequentially rather than all at once. ## Citation If you use AusCyberBench in your research, please cite: ```bibtex @dataset{auscyberbench2025, title={AusCyberBench: Australia's First LLM Cybersecurity Benchmark}, author={Zen0}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Zen0/AusCyberBench} } ``` ## License MIT License - See LICENSE file for details ## Acknowledgements - **Australian Cyber Security Centre (ACSC)** for Essential Eight, ISM, and threat intelligence - **Office of the Australian Information Commissioner (OAIC)** for Privacy Act guidance - **Department of Home Affairs** for SOCI Act resources - **HuggingFace** for infrastructure and model hosting --- **Built with Australian orthography** 🇦🇺 *Visualise • Analyse • Optimise • Quantisation*