# TestGenEval Benchmark Evaluation This folder contains the evaluation harness for the TestGenEval benchmark, which is based on the original TestGenEval benchmark ([paper](https://arxiv.org/abs/2410.00752)). TestGenEval is designed to evaluate the ability of language models to generate unit tests for given Python functions. ## Setup Environment and LLM Configuration 1. Follow the instructions [here](../../README.md#setup) to set up your local development environment and configure your LLM. 2. Install the TestGenEval dependencies: ```bash poetry install --with testgeneval ``` ## Run Inference To generate tests using your model, run the following command: ```bash ./evaluation/benchmarks/testgeneval/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] # Example ./evaluation/benchmarks/testgeneval/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 100 30 1 kjain14/testgenevallite test ``` Parameters: - `model_config`: The config group name for your LLM settings (e.g., `eval_gpt4_1106_preview`) - `git-version`: The git commit hash or release tag of OpenHands to evaluate (e.g., `HEAD` or `0.6.2`) - `agent`: The name of the agent for benchmarks (default: `CodeActAgent`) - `eval_limit`: Limit the evaluation to the first N instances (optional) - `max_iter`: Maximum number of iterations for the agent to run (default: 30) - `num_workers`: Number of parallel workers for evaluation (default: 1) - `dataset`: HuggingFace dataset name (default: `kjain14/testgenevallite`) - `dataset_split`: Dataset split to use (default: `test`) After running the inference, you will obtain an `output.jsonl` file (by default saved to `evaluation/evaluation_outputs`). ## Evaluate Generated Tests To evaluate the generated tests, use the `eval_infer.sh` script: ```bash ./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL [instance_id] [dataset_name] [split] [num_workers] [skip_mutation] # Example ./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl ``` Optional arguments: - `instance_id`: Evaluate a single instance (optional) - `dataset_name`: Name of the dataset to use (default: `kjain14/testgenevallite`) - `split`: Dataset split to use (default: `test`) - `num_workers`: Number of workers for running docker (default: 1) - `skip_mutation`: Skip mutation testing (enter `true` if desired) The evaluation results will be saved to `evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with `output.testgeneval.jsonl` containing the metrics. ## Metrics The TestGenEval benchmark evaluates generated tests based on the following metrics: 1. Correctness: Measures if the generated tests are syntactically correct and run without errors. 2. Coverage: Assesses the code coverage achieved by the generated tests. 3. Mutation Score: Evaluates the effectiveness of the tests in detecting intentionally introduced bugs (mutations). 4. Readability: Analyzes the readability of the generated tests using various metrics. ## Submit Your Evaluation Results To contribute your evaluation results: 1. Fork [our HuggingFace evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation). 2. Add your results to the forked repository. 3. Submit a Pull Request with your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions). ## Additional Resources - [TestGenEval Paper](https://arxiv.org/abs/2410.00752) - [OpenHands Documentation](https://github.com/All-Hands-AI/OpenHands) - [HuggingFace Datasets](https://huggingface.co/datasets) For any questions or issues, please open an issue in the [OpenHands repository](https://github.com/All-Hands-AI/OpenHands/issues).