EvalPlus Commands
evalplus.codegen: Code generation + Code post-processingevalplus.evaluate: Code generation + Code post-processing + Evaluationevalplus.sanitize: Code post-processing
Code Generation
evalplus.codegen support following backends:
vllm: Set--modelas Hugging Face model ID such asmicrosoft/Phi-3-mini-128k-instructhf: HuggingFace Transformers; same way to setup--modelopenai: ConfigureOPENAI_API_KEY; one can configure--base-urlanthropic: ConfigureANTHROPIC_API_KEYgoogle: ConfigureGOOGLE_API_KEYbedrock: ConfigureBEDROCK_ROLE_ARNgptqmodel: Set quantized--modelas Hugging Face model ID such asModelCloud/Qwen2.5-Coder-32B-Instruct-gptqmodel-4bit-vortex-v1ollama: Configure--base-url
evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset [mbpp|humaneval] --backend [vllm|hf|openai|...]
To perform code generation using user-defined tasks and datasets:
# Override HumanEval datasets HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset humaneval --backend [vllm|hf|openai|...] # Override MBPP datasets MBPP_OVERRIDE_PATH="/path/to/MbppPlus.jsonl.gz" evalplus.codegen --model "mistralai/Mistral-7B-Instruct-v0.3" --greedy --root [result_path] --dataset mbpp --backend [vllm|hf|openai|...]
Customized Code Generation
You can perform your own code generation from scratch by doing something like this:
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl samples = [ dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"])) for task_id, problem in get_[human_eval|mbpp]_plus().items() ] write_jsonl("samples.jsonl", samples)
Note
The main structure of problem is as follows:
task_idis the identifier string for the taskentry_pointis name of the functionpromptis the function signature with docstringcanonical_solutionis the ground-truth implementation (re-implemented to fix bugs in HumanEval)base_inputis the test inputs in original HumanEvalplus_inputis the test inputs brought by EvalPlus
Note
Expected Schema of samples.jsonl
task_id: Task ID, which are the keys ofget_[human_eval|mbpp]_plus()solution(optional): Self-contained solution (usually including the prompt)- Example:
{"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
- Example:
completion(optional): Function body without prompt- Example:
{"task_id": "HumanEval/?", "completion": " return 1"}
- Example:
Only one of solution and completion is required. If both are provided, solution will be used.
We also accept solutions in the form of directory, i.e., --samples ${SAMPLE_DIR} where ${SAMPLE_DIR} is organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py (${TASK_ID} = task_id.replace("/", "_")).
Code post-processing
Note
This step is by default performed in evalplus.codegen.
Yet, you might want to use it if you have generated the code using other tools.
LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely evalplus.sanitize to clean up the code:
# ๐ก If you are storing codes in jsonl: evalplus.sanitize --samples samples.jsonl # Sanitized code will be produced to `samples-sanitized.jsonl` # ๐ก If you are storing codes in directories: evalplus.sanitize --samples /path/to/vicuna-[??]b_temp_[??] # Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
๐ Checking the compilability of post-processed code:: click to expand ::
To double-check the post-processing results, you can use evalplus.syncheck to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
# ๐ก If you are storing codes in jsonl: evalplus.syncheck --samples samples.jsonl --dataset [humaneval|mbpp] # ๐ก If you are storing codes in directories: evalplus.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
Code Evaluation
You are strongly recommended to use a sandbox such as docker:
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl...Or if you want to try it locally regardless of the risks โ ๏ธ:
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonlTo use a user-defined dataset locally, you can set HUMANEVAL_OVERRIDE_PATH or MBPP_OVERRIDE_PATH:
HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" evalplus.evaluate --dataset humaneval --samples samples.jsonl๐ค Evaluate with local GitHub repo? :: click to expand ::
export PYTHONPATH=$PYTHONPATH:$(pwd) python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
โจ๏ธ More command-line flags :: click to expand ::
--parallel: by default half of the cores--base-only(store_ture): only run base HumanEval tests--i-just-wanna-run: force a re-run
The output should be like (below is GPT-4 greedy decoding example):
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
Baseis thepass@kfor the original HumanEvalBase + Extrais thepass@kfor the our HumanEval+ (with extra tests)- The "k" includes
[1, 10, 100]where k values<=the sample size will be used - A cache file named like
samples_eval_results.jsonlwill be cached. Remove it to re-run the evaluation
Test input generation using EvalPlus
Please check evalplus/inputgen.py.
Useful tools
We provide some useful tools for curation, visualization, and analysis of the EvalPlus datasets in the tools/ directory.
To use these tools, please first install the repository from GitHub:
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r tools/requirements.txt