This repo provides the code for reproducing the experiments in Code Execution with Pre-trained Language Models. CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.
The pre-trained checkpoint of CodeExecutor is available on Huggingface.
Our dataset is available on Zenodo.
1. Dependency
- pip install pytorch
- pip install transformers
- pip install python-Levenshtein
2. Data
The Python Code Execution datasets are a series of datasets following an easy-to-hard paradigm, including the SingleLine dataset, Tutorial dataset, and CodeNetMut dataset. We provide each test set of the three on Zenodo.
Demo data (simplified version):
{
"id": 0,
"code": "s = ['x', 'y', 'z']",
"code_tokens": ["<0>", "s", "=", "[", "'x'", ",", "'y'", ",", "'z'", "]"],
"trace": ["<line> <0> <state> s : [ x , y , z ] </state>"],
"trace_tokens": ["<line>", "<0>", "<state>", "s", ":", "[", "x", ",", "y", ",", "z", "]", "</state>"]
}We also construct a new dataset for the zero-shot code-to-code search task, by collecting 9,987 Python functions from CodeNet. Each function solves one of the 48 problems.
Demo data (simplified version):
{
"id": 0,
"code_id": "s204511158",
"problem_id": 340, # solve which problem
"original_code": "s = list(input())", # code without providing the test case
"code": "s = ['x', 'y', 'z']", # code provided with a test case
"code_tokens": ["<0>", "s", "=", "[", "'x'", ",", "'y'", ",", "'z'", "]"],
"trace": ["<line> <0> <state> s : [ x , y , z ] </state>"],
"trace_tokens": ["<line>", "<0>", "<state>", "s", ":", "[", "x", ",", "y", ",", "z", "]", "</state>"]
}3. Pre-training
# prepare model checkpoint and datasets cd pretrain bash run.sh
A demo bash script (run.sh) is shown:
# Change the arguments as required: # output_dir: the output directory to save inference results # data_cache_dir: the output directory to save the data cache # train_data_path: the path of the pre-training file # eval_data_path: the path of the test file # model_name_or_path: the path of the model to be evaluated PER_NODE_GPU=8 python -m torch.distributed.launch --nproc_per_node=${PER_NODE_GPU} run.py \ --output_dir ../saved_models/pretrain_codeexecutor_stage_3 \ --data_cache_dir ../saved_models/pretrain_codeexecutor_stage_3 \ --train_data_path /drive/pretrain_codenetmut.json \ --another_train_data_path /drive/pretrain_tutorial.json \ --third_train_data_path /drive/single_line_hard_3_million.json \ --eval_data_path ../data/codenetmut_test.json \ --model_name_or_path ../saved_models/pretrain_codeexecutor_stage_2 \ --block_size 1024 \ --per_gpu_train_batch_size 4 \ --per_gpu_eval_batch_size 8 \ --gradient_accumulation_steps 8 \ --learning_rate 4e-4 \ --node_index=0 \ --gpu_per_node $PER_NODE_GPU \ --weight_decay 0.01 \ --adam_epsilon 1e-6 \ --max_grad_norm 1.0 \ --max_steps 1000000 \ --warmup_steps 10000 \ --save_steps 5000 \ --seed 123
3. Inference
Please download the datasets first. Unzip it and move it to ./data.
# prepare model checkpoint and datasets cd inference bash run.sh
A demo bash script (run.sh) is shown:
# Change the arguments as required: # prefix: dataset type (codenet/tutorial/singleline) # output_dir: the output directory to save inference results # data_cache_dir: the output directory to save the data cache # eval_data_path: the path of the test file # model_name_or_path: the path of the model to be evaluated CUDA_VISIBLE_DEBVISES=0 python run.py \ --prefix codenet \ --output_dir ../../saved_models/inference \ --data_cache_dir ../../saved_models/inference \ --eval_data_path ../data/codenetmut_test.json \ --model_name_or_path microsoft/codeexecutor \ --block_size 1024 \ --per_gpu_train_batch_size 8 \ --per_gpu_eval_batch_size 16 \ --gradient_accumulation_steps 8 \ --learning_rate 1e-4 \ --node_index 0 \ --weight_decay 0.01 \ --adam_epsilon 1e-6 \ --max_grad_norm 1.0 \ --max_steps 1000 \ --warmup_steps 10000 \ --save_steps 5000 \ --seed 123456
4. Downstream tasks
We apply CodeExecutor on code intelligence tasks, such as the Zero-shot Code-to-code Search task. Here, we provide example code in which the baseline model is UniXcoder.
First, generate traces for the code-to-code search test set. We provide the prediction file code_to_code_search_preds.txt on Zenodo.
Or use the following script to generate the prediciton file (will be ../saved_models/code_to_code_search/preds.txt).
# prepare model checkpoint and datasets cd inference CUDA_VISIBLE_DEBVISES=0 python run.py \ --prefix codenet \ --output_dir ../saved_models/code_to_code_search \ --data_cache_dir ../saved_models/code_to_code_search \ --eval_data_path ../data/code_to_code_search_test.json \ --model_name_or_path microsoft/codeexecutor \ --block_size 1024 \ --per_gpu_train_batch_size 8 \ --per_gpu_eval_batch_size 16 \ --gradient_accumulation_steps 8 \ --learning_rate 1e-4 \ --node_index 0 \ --weight_decay 0.01 \ --adam_epsilon 1e-6 \ --max_grad_norm 1.0 \ --max_steps 1000 \ --warmup_steps 10000 \ --save_steps 5000 \ --seed 123456
Second, utilize the program outputs extracted from the execution trace generated by CodeExecutor to facilitate the code-to-code search task.
cd downstream
bash run.shA demo bash script (run.sh) is shown:
# Change the arguments as required: # trace_file: the path to the prediction file either downloaded or generated in the last step source_lang=python target_lang=python python run.py \ --model_name_or_path microsoft/unixcoder-base \ --query_data_file ../data/code_to_code_search_test.json \ --candidate_data_file ../data/code_to_code_search_test.json \ --trace_file ../data/code_to_code_search_preds.txt \ --query_lang ${source_lang} \ --candidate_lang ${target_lang} \ --code_length 512 \ --eval_batch_size 256
Reference
If you use this code or CodeExecutor, please consider citing us.
@article{liu2023code,
title={Code Execution with Pre-trained Language Models},
author={Liu, Chenxiao and Lu, Shuai and Chen, Weizhu and Jiang, Daxin and Svyatkovskiy, Alexey and Fu, Shengyu and Sundaresan, Neel and Duan, Nan},
journal={arXiv preprint arXiv:2305.05383},
year={2023}
}