# Mass Evaluations Simple benchmark tool for running predefined prompts through all checkpoints of a model. ## Usage ```bash python benchmark.py [model_name] [options] ``` ## Examples ```bash # Benchmark all checkpoints of a model python benchmark.py pico-decoder-tiny-dolma5M-v1 # Specify custom output directory python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/ # Use custom prompts file python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json ``` ## Managing Prompts Prompts are stored in `prompts.json` as a simple array of strings: ```json [ "Hello, how are you?", "Complete this story: Once upon a time", "What is the capital of France?" ] ``` ### Adding New Prompts Simply edit `prompts.json` and add new prompt strings to the array. Super simple! ## Features - **Auto-discovery**: Finds all `step_*` checkpoints automatically - **JSON-based prompts**: Easily customizable prompts via JSON file - **Readable output**: Markdown reports with clear structure - **Error handling**: Continues on failures, logs errors - **Progress tracking**: Shows real-time progress - **Metadata logging**: Includes generation time and parameters ## Output Results are saved as markdown files in `results/` directory: ``` results/ ├── pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md ├── pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md └── ... ``` ## Predefined Prompts 1. "Hello, how are you?" (conversational) 2. "Complete this story: Once upon a time" (creative) 3. "Explain quantum physics in simple terms" (explanatory) 4. "Write a haiku about coding" (creative + structured) 5. "What is the capital of France?" (factual) 6. "The meaning of life is" (philosophical) 7. "In the year 2050," (futuristic) 8. "Python programming is" (technical)