The source code of the paper "Automatic Generation of Pull Request Description".
Dataset
Raw Data
Our collected 333K pull requests can be downloaded from here. Here is a PR example in the json file:
{
"id": "elastic/elasticsearch_37980",
"body": "'Eclipse build files were missing so .eclipse project files were not being generated.\\r\\nCloses #37973\\r\\n\\r\\n'",
"cms": [
"'Added missing eclipse-build.gradle files\\n\\nCloses #fix/37973'"
],
"commits": {
"'3e10ee798c932cc1cab1ea6ca679417408fc1416'": {
"cm": "'Added missing eclipse-build.gradle files\\n\\nCloses #fix/37973'",
"comments": []
}
}
}- id:
$user/$project_$prid - body: PR description
- cms: the commit messages in this PR
- commis: the commits in this PR
- key is the SHA1 hash digest
- cm: commit message
- comments: source code comments added in this commit
- key is the SHA1 hash digest
Preprocessed Dataset
Our dataset can be downloaded from here, which contains:
- the train, validation and test sets
- a json file for building vocabulary
Regular Expressions
To preprocess the raw data, we used the following regular expressions:
email_pattern = r'(^|\s)<[\w.-]+@(?=[a-z\d][^.]*\.)[a-z\d.-]*[^.]>' url_pattern = r'https?://[-a-zA-Z0-9@:%._+~#?=/]+(?=($|[^-a-zA-Z0-9@:%._+~#?=/]))' reference_pattern = r'#[\d]+' signature_pattern = r'^(signed-off-by|co-authored-by|also-by):' at_pattern = r'@\S+' structure_pattern = r'^#+' version_pattern = r'(^|\s|-)[\d]+(\.[\d]+){1,}' sha_pattern = r'(^|\s)[\dA-Fa-f-]{7,}(?=(\s|$))' digit_pattern = r'(^|\s|-)[\d]+(?=(\s|$))'
Installation
Clone and Prepare Dataset
$ git clone https://github.com/Tbabm/PRSummarizer.git $ cd PRSummarizer $ mkdir data # download our preprocessed dataset and place the four files in `data` $ mkdir models
Install ROUGE
- See here for instructions about installing ROUGE
- Please make sure you have correctly set environment variable
ROUGEto/absolute/path/to/ROUGE-RELEASE-1.5.5
Install Dependencies
Through conda:
$ conda env create -f environment.yml
OR through pip
$ pip install -r requirements.txt
Install pyrouge
- install and test pyrouge if you haven't done it.
$ git clone https://github.com/bheinzerling/pyrouge $ cd pyrouge $ pip install . # set rouge path for pyrouge $ pyrouge_set_rouge_path ${ROUGE} # test the installation of pyrouge $ python -m pyrouge.test
Usage
Train
Train Attn+PG first:
python3 -m prsum.prsum train --param-path params_attn_pg.json
After training, suppose the models are stored in models/train_12345678/model/. Select the best Attn+PG model:
python3 -m prsum.prsum select_model \
--param_path params_attn_pg.json \
--model_pattern "models/train_12345678/model/model_{}_" \
--start_iter 1000 \
--end_iter 26000Suppose the best model is model_12000_87654321. Train Attn+PG+RL based on the best model:
python3 -m prsum.prsum train \
--param_path params_attn_pg_rl.json \
--model_path "models/train_12345678/model/model_12000_87654321"Validate
Select the best Attn+PG model:
# start_iter = the best iteration of `Attn+PG` (here, 12000) + save_interval (here, 1000) START_ITER=13000 python3 -m prsum.prsum select_model --param_path params_attn_pg_rl.json \ --model_pattern "models/train_12345678/model/rl_model_{}_" \ --start_iter $START_ITER \ --end_iter 41000
Suppose the best model is model_34000_98765432.
Test
Test the best Attn+PG+RL model:
python3 -m prsum.prsum decode \
--param_path params_attn_pg_rl.json \
--model_path "models/train_12345678/model/rl_model_34000_98765432" \
--ngram_filter 1Now, you will get the test results.
NOTE: Your test results may be slightly different from those reported in our paper. Because the pointer generator uses the scatter_add function in pytorch. When using GPUs, this function is undeterministic. See here for more details.
Pre-trained Model
Our pre-trained model and test results can be downloaded here. To test with our pre-atrained model:
mkdir models
mv rl_model_34000 ./models
python3 -m prsum.prsum decode \
--param_path params_attn_pg_rl.json \
--model_path "models/rl_model_34000" \
--ngram_filter 1Citation
If you use this code, please consider citing our paper:
@inproceedings{liu2019automatic, title={Automatic generation of pull request descriptions}, author={Liu, Zhongxin and Xia, Xin and Treude, Christoph and Lo, David and Li, Shanping}, booktitle={Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering}, pages={176--188}, year={2019}, }
Thanks!
Reference
- Our paper: "Automatic Generation of Pull Request Description"
- https://github.com/atulkum/pointer_summarizer
- https://github.com/rohithreddy024/Text-Summarizer-Pytorch