File size: 1,834 Bytes
ac9d65f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Mass Evaluations

Simple benchmark tool for running predefined prompts through all checkpoints of a model.

## Usage

```bash
python benchmark.py [model_name] [options]
```

## Examples

```bash
# Benchmark all checkpoints of a model
python benchmark.py pico-decoder-tiny-dolma5M-v1

# Specify custom output directory
python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/

# Use custom prompts file
python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json
```

## Managing Prompts

Prompts are stored in `prompts.json` as a simple array of strings:

```json
[
  "Hello, how are you?",
  "Complete this story: Once upon a time",
  "What is the capital of France?"
]
```

### Adding New Prompts

Simply edit `prompts.json` and add new prompt strings to the array. Super simple!

## Features

- **Auto-discovery**: Finds all `step_*` checkpoints automatically
- **JSON-based prompts**: Easily customizable prompts via JSON file
- **Readable output**: Markdown reports with clear structure
- **Error handling**: Continues on failures, logs errors
- **Progress tracking**: Shows real-time progress
- **Metadata logging**: Includes generation time and parameters

## Output

Results are saved as markdown files in `results/` directory:
```
results/
β”œβ”€β”€ pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md
β”œβ”€β”€ pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md
└── ...
```

## Predefined Prompts

1. "Hello, how are you?" (conversational)
2. "Complete this story: Once upon a time" (creative)
3. "Explain quantum physics in simple terms" (explanatory)
4. "Write a haiku about coding" (creative + structured)
5. "What is the capital of France?" (factual)
6. "The meaning of life is" (philosophical)
7. "In the year 2050," (futuristic)
8. "Python programming is" (technical)