Zen0 commited on
Commit
7d0c82c
Β·
1 Parent(s): bd99e48

Initial deployment of AusCyberBench Evaluation Dashboard

Browse files

πŸ‡¦πŸ‡Ί Australia's First LLM Cybersecurity Benchmark

Features:
- Interactive Gradio dashboard for model evaluation
- 26 pre-configured models (small, medium, security-focused)
- Evaluates on 13,449 tasks across 6 categories
- Real-time progress tracking and leaderboard
- Australian orthography and color scheme
- Downloadable results (JSON format)

Categories:
- Regulatory: Essential Eight, ISM Controls, Privacy Act, SOCI Act
- Knowledge: Threat Intelligence, Terminology

Dataset: Zen0/AusCyberBench

Files changed (4) hide show
  1. .gitignore +36 -0
  2. README.md +182 -12
  3. app.py +503 -0
  4. requirements.txt +9 -0
.gitignore ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ env/
8
+ venv/
9
+ .venv
10
+
11
+ # Model cache
12
+ .cache/
13
+ models/
14
+ *.bin
15
+ *.safetensors
16
+
17
+ # Results and logs
18
+ *.json
19
+ *.log
20
+ *.csv
21
+
22
+ # HuggingFace cache
23
+ .huggingface/
24
+
25
+ # Jupyter
26
+ .ipynb_checkpoints/
27
+
28
+ # OS
29
+ .DS_Store
30
+ Thumbs.db
31
+
32
+ # IDE
33
+ .vscode/
34
+ .idea/
35
+ *.swp
36
+ *.swo
README.md CHANGED
@@ -1,12 +1,182 @@
1
- ---
2
- title: Auscyberbench Evaluator
3
- emoji: ⚑
4
- colorFrom: red
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: AusCyberBench Evaluation Dashboard
3
+ emoji: πŸ‡¦πŸ‡Ί
4
+ colorFrom: green
5
+ colorTo: yellow
6
+ sdk: gradio
7
+ sdk_version: 4.0.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+
13
+ # πŸ‡¦πŸ‡Ί AusCyberBench Evaluation Dashboard
14
+
15
+ **Australia's First LLM Cybersecurity Benchmark**
16
+
17
+ An interactive dashboard for evaluating language models on Australian cybersecurity knowledge, regulations, and threat intelligence.
18
+
19
+ ## About AusCyberBench
20
+
21
+ AusCyberBench is a comprehensive benchmark dataset containing **13,449 tasks** across six critical categories:
22
+
23
+ ### πŸ“‹ Categories
24
+
25
+ - **πŸ›‘οΈ Regulatory: Essential Eight** (2,558 tasks)
26
+ - ACSC's baseline cybersecurity mitigation strategies
27
+ - Maturity levels 1-3 across 8 mitigation strategies
28
+ - Application whitelisting, patching, MFA, backups, etc.
29
+
30
+ - **πŸ“œ Regulatory: ISM Controls** (7,200 tasks)
31
+ - Information Security Manual control requirements
32
+ - Commonwealth entity security obligations
33
+ - Control effectiveness, implementation, and compliance
34
+
35
+ - **πŸ”’ Regulatory: Privacy Act** (204 tasks)
36
+ - Australian Privacy Principles (APPs)
37
+ - Data protection and privacy obligations
38
+ - Notifiable Data Breaches (NDB) scheme
39
+
40
+ - **⚑ Regulatory: SOCI Act** (240 tasks)
41
+ - Security of Critical Infrastructure Act 2018
42
+ - Critical infrastructure risk management
43
+ - Sector-specific obligations
44
+
45
+ - **🎯 Knowledge: Threat Intelligence** (2,520 tasks)
46
+ - ACSC threat reports and advisories
47
+ - Australian threat landscape
48
+ - Cyber incident response
49
+
50
+ - **πŸ“š Knowledge: Terminology** (727 tasks)
51
+ - Australian cybersecurity terminology
52
+ - ACSC glossary and definitions
53
+ - Industry-specific language
54
+
55
+ ## Features
56
+
57
+ ### πŸ€– 26 Pre-Configured Models
58
+
59
+ Evaluate across diverse model categories:
60
+
61
+ - **Small Models (1-4B):** Phi-3, Gemma-2, Qwen, Llama 3.2, StableLM, TinyLlama
62
+ - **Medium Models (7-12B):** Mistral, Llama 3.1, Gemma-2-9b, Qwen-7B
63
+ - **πŸ”’ Cybersecurity-Focused:** Foundationsec-8B, DeepSeek Coder, WizardCoder, StarCoder2, CodeLlama, CodeGen25
64
+ - **Reasoning & Analysis:** DeepSeek LLM, Yi, SOLAR, Hermes-3
65
+ - **Diverse & Multilingual:** Aya-23, Falcon, OpenChat, OpenHermes
66
+
67
+ ### ⚑ Quick Selection Presets
68
+
69
+ - Select all small models (7) for fast testing
70
+ - Select all security models (6) for cybersecurity focus
71
+ - Select all models (26) for comprehensive evaluation
72
+ - Clear selection with one click
73
+
74
+ ### 🎯 Customisable Evaluation
75
+
76
+ - **Sample size:** 10-500 tasks (default: 200)
77
+ - **4-bit quantisation:** Reduce memory usage for larger models
78
+ - **Temperature:** Control response randomness (0.1-1.0)
79
+ - **Max tokens:** Limit response length (32-256)
80
+
81
+ ### πŸ“Š Real-Time Results
82
+
83
+ - Live leaderboard with rankings (πŸ₯‡πŸ₯ˆπŸ₯‰)
84
+ - Model comparison visualisation in Australian colours
85
+ - Per-category performance breakdown
86
+ - Downloadable results (JSON format)
87
+
88
+ ## Usage
89
+
90
+ 1. **Select Models:** Use checkboxes or quick selection buttons
91
+ 2. **Configure Settings:** Adjust sample size, quantisation, temperature
92
+ 3. **Run Evaluation:** Click "πŸš€ Run Evaluation"
93
+ 4. **Monitor Progress:** Watch real-time progress and intermediate results
94
+ 5. **Analyse Results:** Review leaderboard, charts, and category breakdowns
95
+ 6. **Download:** Export results for further analysis
96
+
97
+ ## Dataset
98
+
99
+ The benchmark is available on HuggingFace:
100
+
101
+ πŸ”— **[Zen0/AusCyberBench](https://huggingface.co/datasets/Zen0/AusCyberBench)**
102
+
103
+ ### Dataset Splits
104
+
105
+ - **Full:** All 13,449 tasks across all categories
106
+ - **Australian:** 4,899 Australia-specific tasks
107
+
108
+ ## Evaluation Methodology
109
+
110
+ ### Prompt Formatting
111
+
112
+ Model-specific chat templates ensure optimal performance:
113
+ - **Phi-3/Phi-3.5:** `<|user|>...<|end|>\n<|assistant|>`
114
+ - **Gemma-2:** `<start_of_turn>user\n...<end_of_turn>\n<start_of_turn>model`
115
+ - **Generic (Llama, Mistral, Qwen, etc.):** `[INST] ... [/INST]`
116
+
117
+ ### Answer Extraction
118
+
119
+ Robust extraction for multiple-choice tasks:
120
+ - Primary: Regex pattern `\b([A-D])\b` matching
121
+ - Fallback: First character validation
122
+ - Handles various response formats
123
+
124
+ ### Memory Management
125
+
126
+ Automatic cleanup between models:
127
+ - Model and tokeniser deletion
128
+ - CUDA cache clearing
129
+ - Garbage collection
130
+ - Prevents OOM errors on GPU instances
131
+
132
+ ## Performance Expectations
133
+
134
+ Based on initial benchmarking:
135
+
136
+ - **Small Models (1-4B):** 10-25% accuracy
137
+ - **Medium Models (7-12B):** 15-30% accuracy
138
+ - **Cybersecurity Models:** 20-35% accuracy (domain-specific advantage)
139
+ - **Reasoning Models:** 25-40% accuracy
140
+
141
+ Performance varies significantly by category:
142
+ - **Essential Eight:** Higher scores (20-40%)
143
+ - **ISM Controls:** Lower scores (10-20%)
144
+ - **Terminology:** Moderate scores (15-30%)
145
+
146
+ ## Technical Requirements
147
+
148
+ This Space requires GPU hardware for model inference. Free-tier GPU instances may experience:
149
+ - Longer evaluation times
150
+ - Memory constraints with larger models
151
+ - 4-bit quantisation recommended for 7B+ models
152
+
153
+ ## Citation
154
+
155
+ If you use AusCyberBench in your research, please cite:
156
+
157
+ ```bibtex
158
+ @dataset{auscyberbench2025,
159
+ title={AusCyberBench: Australia's First LLM Cybersecurity Benchmark},
160
+ author={Zen0},
161
+ year={2025},
162
+ publisher={HuggingFace},
163
+ url={https://huggingface.co/datasets/Zen0/AusCyberBench}
164
+ }
165
+ ```
166
+
167
+ ## License
168
+
169
+ MIT License - See LICENSE file for details
170
+
171
+ ## Acknowledgements
172
+
173
+ - **Australian Cyber Security Centre (ACSC)** for Essential Eight, ISM, and threat intelligence
174
+ - **Office of the Australian Information Commissioner (OAIC)** for Privacy Act guidance
175
+ - **Department of Home Affairs** for SOCI Act resources
176
+ - **HuggingFace** for infrastructure and model hosting
177
+
178
+ ---
179
+
180
+ **Built with Australian orthography** πŸ‡¦πŸ‡Ί
181
+
182
+ *Visualise β€’ Analyse β€’ Optimise β€’ Quantisation*
app.py ADDED
@@ -0,0 +1,503 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ AusCyberBench Evaluation Dashboard
4
+ Interactive Gradio Space for benchmarking LLMs on Australian cybersecurity knowledge
5
+ """
6
+
7
+ import gradio as gr
8
+ import torch
9
+ import gc
10
+ import json
11
+ import re
12
+ import time
13
+ import pandas as pd
14
+ import matplotlib.pyplot as plt
15
+ import seaborn as sns
16
+ from pathlib import Path
17
+ from collections import defaultdict
18
+ from datasets import load_dataset
19
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
20
+ import numpy as np
21
+
22
+ # Australian color scheme
23
+ AUSSIE_GREEN = '#008751'
24
+ AUSSIE_GOLD = '#FFB81C'
25
+
26
+ # Model categories with all 26 models
27
+ MODELS_BY_CATEGORY = {
28
+ "Small Models (1-4B)": [
29
+ "microsoft/Phi-3-mini-4k-instruct",
30
+ "microsoft/Phi-3.5-mini-instruct",
31
+ "google/gemma-2-2b-it",
32
+ "Qwen/Qwen2.5-3B-Instruct",
33
+ "meta-llama/Llama-3.2-3B-Instruct",
34
+ "stabilityai/stablelm-2-1_6b-chat",
35
+ "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
36
+ ],
37
+ "Medium Models (7-12B)": [
38
+ "mistralai/Mistral-7B-Instruct-v0.3",
39
+ "Qwen/Qwen2.5-7B-Instruct",
40
+ "meta-llama/Llama-3.1-8B-Instruct",
41
+ "google/gemma-2-9b-it",
42
+ "mistralai/Mistral-Nemo-Instruct-2407",
43
+ ],
44
+ "πŸ”’ Cybersecurity-Focused": [
45
+ "Eldorado-AI/Foundationsec-8B",
46
+ "deepseek-ai/deepseek-coder-6.7b-instruct",
47
+ "WizardLM/WizardCoder-Python-7B-V1.0",
48
+ "bigcode/starcoder2-7b",
49
+ "meta-llama/CodeLlama-7b-Instruct-hf",
50
+ "Salesforce/codegen25-7b-instruct",
51
+ ],
52
+ "Reasoning & Analysis": [
53
+ "deepseek-ai/deepseek-llm-7b-chat",
54
+ "01-ai/Yi-1.5-9B-Chat",
55
+ "upstage/SOLAR-10.7B-Instruct-v1.0",
56
+ "NousResearch/Hermes-3-Llama-3.1-8B",
57
+ ],
58
+ "Diverse & Multilingual": [
59
+ "CohereForAI/aya-23-8B",
60
+ "tiiuae/falcon-7b-instruct",
61
+ "openchat/openchat-3.5-0106",
62
+ "teknium/OpenHermes-2.5-Mistral-7B",
63
+ ],
64
+ }
65
+
66
+ # Flatten all models
67
+ ALL_MODELS = [model for category in MODELS_BY_CATEGORY.values() for model in category]
68
+
69
+ # Global state
70
+ current_results = []
71
+ dataset_cache = None
72
+
73
+
74
+ def load_benchmark_dataset(subset="australian", num_samples=200):
75
+ """Load and sample AusCyberBench dataset"""
76
+ global dataset_cache
77
+
78
+ if dataset_cache is None:
79
+ dataset_cache = load_dataset("Zen0/AusCyberBench", split=subset)
80
+
81
+ # Proportional sampling
82
+ import random
83
+ random.seed(42)
84
+
85
+ by_category = defaultdict(list)
86
+ for item in dataset_cache:
87
+ by_category[item['category']].append(item)
88
+
89
+ total = len(dataset_cache)
90
+ samples = []
91
+
92
+ for cat, items in by_category.items():
93
+ n_cat = max(1, int(len(items) / total * num_samples))
94
+ if len(items) <= n_cat:
95
+ samples.extend(items)
96
+ else:
97
+ samples.extend(random.sample(items, n_cat))
98
+
99
+ random.shuffle(samples)
100
+ return samples[:num_samples]
101
+
102
+
103
+ def format_prompt(task, model_name):
104
+ """Format task as prompt with proper chat template"""
105
+ question = task['description']
106
+
107
+ if task.get('task_type') == 'multiple_choice' and 'options' in task:
108
+ options_text = "\n".join([f"{opt['id']}. {opt['text']}" for opt in task['options']])
109
+
110
+ if 'phi' in model_name.lower():
111
+ return f"""<|user|>
112
+ {question}
113
+
114
+ {options_text}
115
+
116
+ Respond with ONLY the letter of the correct answer (A, B, C, or D).<|end|>
117
+ <|assistant|>"""
118
+ elif 'gemma' in model_name.lower():
119
+ return f"""<start_of_turn>user
120
+ {question}
121
+
122
+ {options_text}
123
+
124
+ Respond with ONLY the letter of the correct answer (A, B, C, or D).<end_of_turn>
125
+ <start_of_turn>model
126
+ """
127
+ else:
128
+ return f"""[INST] {question}
129
+
130
+ {options_text}
131
+
132
+ Respond with ONLY the letter of the correct answer (A, B, C, or D). [/INST]"""
133
+ else:
134
+ return f"""[INST] {question} [/INST]"""
135
+
136
+
137
+ def extract_answer(response, task):
138
+ """Extract answer letter from model response"""
139
+ response = response.strip()
140
+
141
+ if task.get('task_type') == 'multiple_choice':
142
+ match = re.search(r'\b([A-D])\b', response, re.IGNORECASE)
143
+ if match:
144
+ return match.group(1).upper()
145
+ if response and response[0].upper() in ['A', 'B', 'C', 'D']:
146
+ return response[0].upper()
147
+ return ""
148
+ else:
149
+ return response[:100]
150
+
151
+
152
+ def cleanup_model(model, tokenizer):
153
+ """Thoroughly clean up model to free memory"""
154
+ if model is not None:
155
+ del model
156
+ if tokenizer is not None:
157
+ del tokenizer
158
+
159
+ if torch.cuda.is_available():
160
+ torch.cuda.empty_cache()
161
+ torch.cuda.ipc_collect()
162
+
163
+ gc.collect()
164
+
165
+
166
+ def evaluate_single_model(model_name, tasks, use_4bit=True, temperature=0.7, max_tokens=128, progress=gr.Progress()):
167
+ """Evaluate a single model on the benchmark"""
168
+ progress(0, desc=f"Loading {model_name.split('/')[-1]}...")
169
+
170
+ try:
171
+ # Load model
172
+ if use_4bit:
173
+ quant_config = BitsAndBytesConfig(
174
+ load_in_4bit=True,
175
+ bnb_4bit_compute_dtype=torch.float16,
176
+ bnb_4bit_use_double_quant=True,
177
+ bnb_4bit_quant_type="nf4"
178
+ )
179
+ else:
180
+ quant_config = None
181
+
182
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
183
+ model = AutoModelForCausalLM.from_pretrained(
184
+ model_name,
185
+ quantization_config=quant_config,
186
+ device_map="auto",
187
+ trust_remote_code=True,
188
+ torch_dtype=torch.float16 if not use_4bit else None
189
+ )
190
+
191
+ if tokenizer.pad_token is None:
192
+ tokenizer.pad_token = tokenizer.eos_token
193
+
194
+ progress(0.1, desc=f"Evaluating {model_name.split('/')[-1]}...")
195
+
196
+ # Evaluate tasks
197
+ results = []
198
+ for i, task in enumerate(tasks):
199
+ progress((0.1 + 0.8 * i / len(tasks)), desc=f"Task {i+1}/{len(tasks)}")
200
+
201
+ try:
202
+ prompt = format_prompt(task, model_name)
203
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
204
+
205
+ if 'token_type_ids' in inputs:
206
+ inputs.pop('token_type_ids')
207
+
208
+ with torch.no_grad():
209
+ outputs = model.generate(
210
+ **inputs,
211
+ max_new_tokens=max_tokens,
212
+ temperature=temperature,
213
+ do_sample=True,
214
+ top_p=0.9,
215
+ pad_token_id=tokenizer.eos_token_id
216
+ )
217
+
218
+ response = tokenizer.decode(
219
+ outputs[0][inputs['input_ids'].shape[1]:],
220
+ skip_special_tokens=True
221
+ )
222
+
223
+ predicted = extract_answer(response, task)
224
+ correct = task.get('answer', '')
225
+ is_correct = predicted.upper() == correct.upper()
226
+
227
+ results.append({
228
+ 'task_id': task.get('task_id'),
229
+ 'category': task.get('category'),
230
+ 'predicted': predicted,
231
+ 'correct': correct,
232
+ 'is_correct': is_correct
233
+ })
234
+
235
+ except Exception as e:
236
+ results.append({
237
+ 'task_id': task.get('task_id'),
238
+ 'category': task.get('category'),
239
+ 'predicted': '',
240
+ 'correct': task.get('answer', ''),
241
+ 'is_correct': False
242
+ })
243
+
244
+ # Calculate metrics
245
+ total_correct = sum(1 for r in results if r['is_correct'])
246
+ overall_accuracy = (total_correct / len(results)) * 100
247
+
248
+ category_stats = defaultdict(lambda: {'correct': 0, 'total': 0})
249
+ for result in results:
250
+ cat = result['category']
251
+ category_stats[cat]['total'] += 1
252
+ if result['is_correct']:
253
+ category_stats[cat]['correct'] += 1
254
+
255
+ category_scores = {
256
+ cat: (stats['correct'] / stats['total']) * 100 if stats['total'] > 0 else 0
257
+ for cat, stats in category_stats.items()
258
+ }
259
+
260
+ progress(1.0, desc="Complete!")
261
+
262
+ return {
263
+ 'model': model_name,
264
+ 'overall_accuracy': overall_accuracy,
265
+ 'total_correct': total_correct,
266
+ 'total_tasks': len(results),
267
+ 'category_scores': category_scores,
268
+ 'detailed_results': results
269
+ }
270
+
271
+ except Exception as e:
272
+ return {
273
+ 'model': model_name,
274
+ 'error': str(e),
275
+ 'overall_accuracy': 0,
276
+ 'total_correct': 0,
277
+ 'total_tasks': len(tasks)
278
+ }
279
+
280
+ finally:
281
+ cleanup_model(
282
+ model if 'model' in locals() else None,
283
+ tokenizer if 'tokenizer' in locals() else None
284
+ )
285
+
286
+
287
+ def run_evaluation(selected_models, num_samples, use_4bit, temperature, max_tokens, progress=gr.Progress()):
288
+ """Run evaluation on selected models"""
289
+ global current_results
290
+
291
+ if not selected_models:
292
+ return "Please select at least one model to evaluate.", None, None
293
+
294
+ # Load dataset
295
+ progress(0, desc="Loading AusCyberBench dataset...")
296
+ tasks = load_benchmark_dataset(num_samples=num_samples)
297
+
298
+ # Evaluate each model
299
+ current_results = []
300
+ for i, model_name in enumerate(selected_models):
301
+ progress((i / len(selected_models)), desc=f"Model {i+1}/{len(selected_models)}")
302
+
303
+ result = evaluate_single_model(
304
+ model_name, tasks, use_4bit, temperature, max_tokens, progress
305
+ )
306
+ current_results.append(result)
307
+
308
+ # Yield intermediate results
309
+ yield format_results_table(current_results), create_comparison_chart(current_results), None
310
+
311
+ # Final results
312
+ final_table = format_results_table(current_results)
313
+ final_chart = create_comparison_chart(current_results)
314
+ download_data = create_download_data(current_results)
315
+
316
+ yield final_table, final_chart, download_data
317
+
318
+
319
+ def format_results_table(results):
320
+ """Format results as DataFrame for display"""
321
+ if not results:
322
+ return pd.DataFrame()
323
+
324
+ rows = []
325
+ for result in results:
326
+ if 'error' in result:
327
+ rows.append({
328
+ 'Rank': '❌',
329
+ 'Model': result['model'].split('/')[-1],
330
+ 'Accuracy': '0.0%',
331
+ 'Correct/Total': f"0/{result['total_tasks']}",
332
+ 'Status': f"Error: {result['error'][:50]}"
333
+ })
334
+ else:
335
+ rows.append({
336
+ 'Rank': '',
337
+ 'Model': result['model'].split('/')[-1],
338
+ 'Accuracy': f"{result['overall_accuracy']:.1f}%",
339
+ 'Correct/Total': f"{result['total_correct']}/{result['total_tasks']}",
340
+ 'Status': 'βœ“ Complete'
341
+ })
342
+
343
+ df = pd.DataFrame(rows)
344
+
345
+ # Sort by accuracy and assign ranks
346
+ df['_sort'] = df['Accuracy'].str.replace('%', '').astype(float)
347
+ df = df.sort_values('_sort', ascending=False)
348
+ df['Rank'] = ['πŸ₯‡', 'πŸ₯ˆ', 'πŸ₯‰'] + [''] * (len(df) - 3)
349
+ df = df.drop('_sort', axis=1)
350
+
351
+ return df
352
+
353
+
354
+ def create_comparison_chart(results):
355
+ """Create bar chart comparing model accuracies"""
356
+ if not results or all('error' in r for r in results):
357
+ return None
358
+
359
+ valid_results = [r for r in results if 'error' not in r]
360
+ if not valid_results:
361
+ return None
362
+
363
+ models = [r['model'].split('/')[-1] for r in valid_results]
364
+ accuracies = [r['overall_accuracy'] for r in valid_results]
365
+
366
+ # Sort by accuracy
367
+ sorted_pairs = sorted(zip(models, accuracies), key=lambda x: x[1], reverse=True)
368
+ models, accuracies = zip(*sorted_pairs)
369
+
370
+ plt.figure(figsize=(12, max(6, len(models) * 0.4)))
371
+ bars = plt.barh(models, accuracies, color=AUSSIE_GREEN)
372
+
373
+ # Add accuracy labels
374
+ for i, (model, acc) in enumerate(zip(models, accuracies)):
375
+ plt.text(acc + 1, i, f'{acc:.1f}%', va='center', fontweight='bold')
376
+
377
+ plt.xlabel('Accuracy (%)', fontsize=12, fontweight='bold')
378
+ plt.title('AusCyberBench: Model Comparison', fontsize=14, fontweight='bold')
379
+ plt.xlim(0, 100)
380
+ plt.grid(axis='x', alpha=0.3)
381
+ plt.tight_layout()
382
+
383
+ return plt
384
+
385
+
386
+ def create_download_data(results):
387
+ """Create downloadable results file"""
388
+ if not results:
389
+ return None
390
+
391
+ # Create comprehensive results JSON
392
+ output = {
393
+ 'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
394
+ 'benchmark': 'AusCyberBench',
395
+ 'results': results
396
+ }
397
+
398
+ # Save to file
399
+ output_path = 'auscyberbench_results.json'
400
+ with open(output_path, 'w') as f:
401
+ json.dump(output, f, indent=2)
402
+
403
+ return output_path
404
+
405
+
406
+ # Build Gradio interface
407
+ with gr.Blocks(title="AusCyberBench Evaluation Dashboard", theme=gr.themes.Soft()) as app:
408
+ gr.Markdown("""
409
+ # πŸ‡¦πŸ‡Ί AusCyberBench Evaluation Dashboard
410
+
411
+ **Australia's First LLM Cybersecurity Benchmark**
412
+
413
+ Test multiple language models on Australian cybersecurity knowledge including Essential Eight,
414
+ ISM Controls, Privacy Act, SOCI Act, and ACSC Threat Intelligence.
415
+ """)
416
+
417
+ with gr.Row():
418
+ with gr.Column(scale=1):
419
+ gr.Markdown("### πŸ“‹ Model Selection")
420
+
421
+ # Quick selection buttons
422
+ with gr.Row():
423
+ btn_small = gr.Button("Select Small Models (7)", size="sm")
424
+ btn_security = gr.Button("Select Security Models (6)", size="sm")
425
+ btn_all = gr.Button("Select All (26)", size="sm")
426
+ btn_clear = gr.Button("Clear", size="sm")
427
+
428
+ # Model checkboxes by category
429
+ model_checkboxes = []
430
+ for category, models in MODELS_BY_CATEGORY.items():
431
+ gr.Markdown(f"**{category}**")
432
+ for model in models:
433
+ short_name = model.split('/')[-1]
434
+ cb = gr.Checkbox(label=f"{short_name}", value=False)
435
+ model_checkboxes.append((cb, model))
436
+
437
+ gr.Markdown("### βš™οΈ Settings")
438
+ num_samples = gr.Slider(10, 500, value=200, step=10, label="Number of Tasks")
439
+ use_4bit = gr.Checkbox(label="Use 4-bit Quantisation", value=True)
440
+ temperature = gr.Slider(0.1, 1.0, value=0.7, step=0.1, label="Temperature")
441
+ max_tokens = gr.Slider(32, 256, value=128, step=32, label="Max Tokens")
442
+
443
+ run_btn = gr.Button("πŸš€ Run Evaluation", variant="primary", size="lg")
444
+
445
+ with gr.Column(scale=2):
446
+ gr.Markdown("### πŸ“Š Results")
447
+
448
+ results_table = gr.Dataframe(
449
+ label="Leaderboard",
450
+ headers=["Rank", "Model", "Accuracy", "Correct/Total", "Status"],
451
+ interactive=False
452
+ )
453
+
454
+ comparison_plot = gr.Plot(label="Model Comparison")
455
+
456
+ download_file = gr.File(label="Download Results (JSON)")
457
+
458
+ # Quick select button actions
459
+ def select_small():
460
+ return [gr.update(value=(model in MODELS_BY_CATEGORY["Small Models (1-4B)"]))
461
+ for cb, model in model_checkboxes]
462
+
463
+ def select_security():
464
+ return [gr.update(value=(model in MODELS_BY_CATEGORY["πŸ”’ Cybersecurity-Focused"]))
465
+ for cb, model in model_checkboxes]
466
+
467
+ def select_all():
468
+ return [gr.update(value=True) for _ in model_checkboxes]
469
+
470
+ def clear_all():
471
+ return [gr.update(value=False) for _ in model_checkboxes]
472
+
473
+ btn_small.click(select_small, outputs=[cb for cb, _ in model_checkboxes])
474
+ btn_security.click(select_security, outputs=[cb for cb, _ in model_checkboxes])
475
+ btn_all.click(select_all, outputs=[cb for cb, _ in model_checkboxes])
476
+ btn_clear.click(clear_all, outputs=[cb for cb, _ in model_checkboxes])
477
+
478
+ # Run evaluation
479
+ def prepare_evaluation(*checkbox_values):
480
+ selected = [model for (cb, model), val in zip(model_checkboxes, checkbox_values) if val]
481
+ return selected
482
+
483
+ run_btn.click(
484
+ fn=lambda *args: run_evaluation(
485
+ prepare_evaluation(*args[:-4]),
486
+ int(args[-4]),
487
+ args[-3],
488
+ args[-2],
489
+ int(args[-1])
490
+ ),
491
+ inputs=[cb for cb, _ in model_checkboxes] + [num_samples, use_4bit, temperature, max_tokens],
492
+ outputs=[results_table, comparison_plot, download_file]
493
+ )
494
+
495
+ gr.Markdown("""
496
+ ---
497
+ **Dataset:** [Zen0/AusCyberBench](https://huggingface.co/datasets/Zen0/AusCyberBench) |
498
+ **License:** Apache 2.0 |
499
+ **Models:** 26 LLMs including security-focused variants
500
+ """)
501
+
502
+ if __name__ == "__main__":
503
+ app.queue().launch()
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ transformers>=4.40.0
3
+ torch>=2.0.0
4
+ accelerate>=0.27.0
5
+ bitsandbytes>=0.43.0
6
+ datasets>=2.18.0
7
+ pandas>=2.0.0
8
+ matplotlib>=3.7.0
9
+ seaborn>=0.13.0