CodeBERT/LongCoder at master · microsoft/CodeBERT

This repo will provide the code for reproducing the experiments on LCC datasets in LongCoder: A Long-Range Pre-trained Language Model for Code Completion. LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.

1. Dependency

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install --upgrade transformers fuzzywuzzy tree_sitter datasets

2. Dataset

In this repo, the LCC dataset will be automatically downloaded when running the fine-tuning script. If you want to download LCC datasets by yourself, you can find them in the following links:

https://huggingface.co/datasets/microsoft/LCC_python
https://huggingface.co/datasets/microsoft/LCC_java
https://huggingface.co/datasets/microsoft/LCC_csharp

3. Fine-Tune Setting

Here we provide fine-tune settings for code completion on LCC datasets in C# programming language, whose results are reported in the paper.

Note that it requires 8 v100-32G GPUs, and you can adjust batch size or source length based on your requirements.

lang=csharp #csharp, python, java
lr=2e-4
batch_size=16
beam_size=5
source_length=3968
target_length=128
global_length=64
window_size=512
epochs=10
output_dir=saved_models/$lang
mkdir -p $output_dir

python run.py \
--do_train \
--do_eval \
--lang $lang \
--output_dir $output_dir \
--model_name_or_path microsoft/longcoder-base \
--filename microsoft/LCC_$lang \
--max_source_length $source_length \
--max_target_length $target_length \
--max_global_length $global_length \
--window_size $window_size \
--beam_size $beam_size \
--train_batch_size $batch_size \
--eval_batch_size $batch_size \
--learning_rate $lr \
--num_train_epochs $epochs  2>&1| tee $output_dir/train.log

4. Evaluating LongCoder

lang=csharp #csharp, python, java
batch_size=16
beam_size=5
source_length=3968
target_length=128
global_length=64
window_size=512
output_dir=saved_models/$lang
reload_model=$output_dir/checkpoint-best-acc/model.bin

python run.py \
--do_test \
--lang $lang \
--load_model_path $reload_model \
--output_dir $output_dir \
--model_name_or_path microsoft/longcoder-base \
--filename microsoft/LCC_$lang \
--max_source_length $source_length \
--max_target_length $target_length \
--max_global_length $global_length \
--window_size $window_size \
--beam_size $beam_size \
--train_batch_size $batch_size \
--eval_batch_size $batch_size \
--num_train_epochs $epochs 2>&1| tee $output_dir/test.log

Reference

If you use this code or LongCoder, please consider citing us.

@article{longcoder,
    title={LongCoder: A Long-Range Pre-trained Language Model for Code Completion},
    author={Daya Guo and Canwen Xu and Nan Duan and Jian Yin and Julian McAuley},
    journal={arXiv preprint arXiv:2306.14893},
    year={2023}
}