To address the limitations of DevEval, we build upon EvoCodeBench, a predecessor framework that shares core architectural features but differs in labeling format and dataset scale. Although EvoCodeBench includes fewer repositories, its codebase is largely compatible with DevEval, making it a suitable foundation for our enhancements. The majority of our development and validation efforts were carried out using the EvoCodeBench dataset, as described below.
The recommended way to run evaluation is with Docker. The script uses the pre-built image konstfed/evocodebenchplus (pulled automatically) or your locally built image.
Run oracle evaluation (pass@1):
bash bash/docker_run_full.shResults appear under experiments/: test results in experiments/tests/, pass@k in experiments/pass_at_k/. Expect pass@1 = 1.0 for the oracle; if it drops (e.g. due to timeouts), lower parallelism by setting -j in the script (e.g. -j 4).
Run with your own completions: set env vars before the script, then run it. Example:
COMPLETIONS=experiments/completions/my_model.jsonl \
TESTS_JSON=experiments/tests/my_model-results.json \
PASSATK_JSON=experiments/pass_at_k/my_model.json \
bash bash/docker_run_full.shCOMPLETIONS is input; TESTS_JSON and PASSATK_JSON are output paths (test results and pass@k metrics). Other options (TASKS, LOGS_DIR, K_VALUES, etc.) are in bash/docker_run_full.sh.
From the repo root:
1. Install dependencies and load dataset
pip install -r requirements.txt
./setup.sh2. Build per-repo virtual environments
python setup_venvs.py \
-t dataset/data/oracle.jsonl \
-o dataset/data/data-success.jsonl \
--oracle-completions experiments/completions/oracle/oracle.jsonl \
--repos dataset/repos \
--venvs venvs \
-j 83. Run tests
python run_tests.py \
-j 8 \
-t dataset/data/data-success.jsonl \
-c experiments/completions/oracle/oracle.jsonl \
--tests experiments/tests/oracle-results.json \
-l experiments/.logs4. Compute pass@k
python evaluate/testing.py \
--tests experiments/tests/oracle-results.json \
--output experiments/pass_at_k/oracle.json \
-k 1 5 10For your own completions, point -c and --tests/--output to your completion file and desired output paths; keep -t and --venvs consistent with step 2.
To verify the correctness of the benchmark evaluation pipeline, we developed an oracle completion script that injects reference code directly into the EvoCodeBench format. An ideal benchmark should yield a pass@1 of 1.0 on such oracle completions - signifying that all tests pass when the groundtruth implementation is used. Surprisingly, our initial evaluation produced pass@1 = 0.0. Upon inspection, we discovered that subprocesses responsible for executing tests were failing due to improperly configured environments—specifically, missing environment variables that were not propagated to the forked processes. After resolving this issue, a re-run yielded pass@1 = 0.3636, still far from the expected result. The lack of any logging around test execution made root-cause analysis particularly challenging.
To improve observability and reliability, we refactored the original test execution logic with a focus on transparency and debuggability. Our revised script logs critical runtime details, including:
- The return code from pytest
- Generated junitxml reports
- Standard output and standard error streams from the test subprocess These logs enabled a more thorough error analysis, revealing that a primary cause of failure was broken or incomplete virtual environments (venvs) across many repositories. To address this, we implemented an automated setup script that iterates through all repositories and attempts to install dependencies into isolated environments. While this succeeded for the majority of cases, some environments remained broken due to unsatisfiable dependencies. Rather than fixing these manually - an approach that is both time-intensive and non-scalable - we chose to exclude such repositories from the final dataset.
Even after fixing the environment issues, the pass@1 score on oracle completions remained below 1.0. We concluded that this was likely due to either missing dependencies or flawed test cases. To ensure the integrity of the benchmark, we filtered out test cases that failed on oracle completions. After this curation step, we achieved the expected pass@1 = 1.0 for oracle completions - confirming the validity of the updated evaluation pipeline.