File size: 1,834 Bytes
ac9d65f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
# Mass Evaluations
Simple benchmark tool for running predefined prompts through all checkpoints of a model.
## Usage
```bash
python benchmark.py [model_name] [options]
```
## Examples
```bash
# Benchmark all checkpoints of a model
python benchmark.py pico-decoder-tiny-dolma5M-v1
# Specify custom output directory
python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/
# Use custom prompts file
python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json
```
## Managing Prompts
Prompts are stored in `prompts.json` as a simple array of strings:
```json
[
"Hello, how are you?",
"Complete this story: Once upon a time",
"What is the capital of France?"
]
```
### Adding New Prompts
Simply edit `prompts.json` and add new prompt strings to the array. Super simple!
## Features
- **Auto-discovery**: Finds all `step_*` checkpoints automatically
- **JSON-based prompts**: Easily customizable prompts via JSON file
- **Readable output**: Markdown reports with clear structure
- **Error handling**: Continues on failures, logs errors
- **Progress tracking**: Shows real-time progress
- **Metadata logging**: Includes generation time and parameters
## Output
Results are saved as markdown files in `results/` directory:
```
results/
βββ pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md
βββ pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md
βββ ...
```
## Predefined Prompts
1. "Hello, how are you?" (conversational)
2. "Complete this story: Once upon a time" (creative)
3. "Explain quantum physics in simple terms" (explanatory)
4. "Write a haiku about coding" (creative + structured)
5. "What is the capital of France?" (factual)
6. "The meaning of life is" (philosophical)
7. "In the year 2050," (futuristic)
8. "Python programming is" (technical)
|