📙About • 🔥Quick Start • 🚀LLM Backends • 📚Documents • 📜Citation • 🙏Acknowledgement
📢 News
Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:
- Meta Llama 3.1 and 3.3
- Allen AI TÜLU 1/2/3
- Qwen2.5-Coder
- CodeQwen 1.5
- DeepSeek-Coder V2
- Qwen2
- Snowflake Arctic
- StarCoder2
- Magicoder
- WizardCoder
Below tracks the notable updates of EvalPlus:
- [2024-10-20
v0.3.1]: EvalPlusv0.3.1is officially released! Highlights: (i) Code efficiency evaluation via EvalPerf, (ii) one command to run all: generation + post-processing + evaluation, (iii) support for more inference backends such as Google Gemini & Anthropic, etc. - [2024-06-09 pre
v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena. - [2024-04-17 pre
v0.3.0]: MBPP+ is upgraded tov0.2.0by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
Earlier news :: click to expand ::
- (
v0.2.1) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32). - (
v0.2.0) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - (
v0.1.7) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6) - (
v0.1.6) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - (
v0.1.5) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - (
v0.1.1) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - (
v0.1.0) HumanEval+ is released!
📙 About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ HumanEval+: 80x more tests than the original HumanEval!
- ✨ MBPP+: 35x more tests than the original MBPP!
- ✨ EvalPerf: evaluating the efficiency of LLM-generated code!
- ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ Precise evaluation: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Coding rigorousness: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
- ✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.
Want to know more details? Read our papers & materials!
- EvalPlus: NeurIPS'23 paper, Slides, Poster, Leaderboard
- EvalPerf: COLM'24 paper, Poster, Documentation, Leaderboard
🔥 Quick Start
Code Correctness Evaluation: HumanEval(+) or MBPP(+)
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus" # Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --greedy
🛡️ Safe code execution within Docker :: click to expand ::
# Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset humaneval \ --backend vllm \ --greedy # Code execution within Docker docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evaluate --dataset humaneval \ --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
Code Efficiency Evaluation: EvalPerf (*nix only)
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus" # Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
🛡️ Safe code execution within Docker :: click to expand ::
# Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset evalperf \ --backend vllm \ --temperature 1.0 \ --n-samples 100 # Code execution within Docker sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl
🚀 LLM Backends
HuggingFace models
transformersbackend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --greedy
Note
EvalPlus uses different prompts for base and chat models.
By default it is detected by tokenizer.chat_template when using hf/vllm as backend.
For other backends, only chat mode is allowed.
Therefore, if your base models come with a tokenizer.chat_template,
please add --force-base-prompt to avoid being evaluated
in a chat mode.
Enable Flash Attention 2 :: click to expand ::
# Install Flash Attention 2 pip install packaging ninja pip install flash-attn --no-build-isolation # Note: if you have installation problem, consider using pre-built # wheels from https://github.com/Dao-AILab/flash-attention/releases # Run evaluation with FA2 evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --attn-implementation [flash_attention_2|sdpa] \ --greedy
vllmbackend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --tp [TENSOR_PARALLEL_SIZE] \ --greedy
openaicompatible servers (e.g., vLLM):
# OpenAI models export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys evalplus.evaluate --model "gpt-4o-2024-08-06" \ --dataset [humaneval|mbpp] \ --backend openai --greedy # DeepSeek export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys evalplus.evaluate --model "deepseek-chat" \ --dataset [humaneval|mbpp] \ --base-url https://api.deepseek.com \ --backend openai --greedy # Grok export OPENAI_API_KEY="{KEY}" # https://console.x.ai/ evalplus.evaluate --model "grok-beta" \ --dataset [humaneval|mbpp] \ --base-url https://api.x.ai/v1 \ --backend openai --greedy # vLLM server # First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --base-url http://localhost:8000/v1 \ --backend openai --greedy # GPTQModel evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \ --dataset [humaneval|mbpp] \ --backend gptqmodel --greedy
OpenAI models
- Access OpenAI APIs from OpenAI Console
export OPENAI_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gpt-4o" \ --dataset [humaneval|mbpp] \ --backend openai \ --greedy
Anthropic models
- Access Anthropic APIs from Anthropic Console
export ANTHROPIC_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "claude-3-haiku-20240307" \ --dataset [humaneval|mbpp] \ --backend anthropic \ --greedy
Google Gemini models
- Access Gemini APIs from Google AI Studio
export GOOGLE_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gemini-1.5-pro" \ --dataset [humaneval|mbpp] \ --backend google \ --greedy
Amazon Bedrock models
export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]" evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \ --dataset [humaneval|mbpp] \ --backend bedrock \ --greedy
You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git cd evalplus export PYTHONPATH=$PYTHONPATH:$(pwd) pip install -r requirements.txt
📚 Documents
To learn more about how to use EvalPlus, please refer to:
📜 Citation
@inproceedings{evalplus, title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, url = {https://openreview.net/forum?id=1qvx610Cu7}, } @inproceedings{evalperf, title = {Evaluating Language Models for Efficient Code Generation}, author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=IBCBMeAhmC}, }