The official code for our paper: Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning (Accepted at EMNLP 2021 short paper)
The implementation is based on the paper DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations and code implementation at (https://github.com/JohnGiorgi/DeCLUTR).
Installation
This repository requires Python 3.6.1 or later.
Setting up a virtual environment
Before installing, you should create and activate a Python virtual environment. See here for detailed instructions.
Installing the library and dependencies
git clone https://github.com/vano1205/EfficientCL
cd EfficientCL
pip install -r requirements.txtUsage
Preparing a dataset
A dataset is simply a file containing one item of text (a document, a scientific paper, etc.) per line. For demonstration purposes, we have provided a script that will download the WikiText-103 dataset and match our minimal preprocessing
python scripts/preprocess_wikitext_103.py path/to/output/wikitext-103/train.txt --min-length 2048
See scripts/preprocess_openwebtext.py for a script that can be used to recreate the (much larger) dataset used in our paper.
You can specify the train set path in the configs under "train_data_path".
Training
To train the model, use the allennlp train command with our efficientcl.jsonnet config. Run the following
allennlp train "training_config/efficientcl.jsonnet" \ --serialization-dir "output" \ --overrides "{'train_data_path': 'path/to/your/dataset/train.txt'}" \ --include-package "efficientcl"
The --overrides flag allows you to override any field in the config with a JSON-formatted string, but you can equivalently update the config itself if you prefer. During training, models, vocabulary, configuration, and log files will be saved to the directory provided by --serialization-dir. This can be changed to any directory you like.
Multi-GPU training
To train on more than one GPU, provide a list of CUDA devices in your call to allennlp train. For example, to train with four CUDA devices with IDs 0, 1, 2, 3
--overrides "{'distributed.cuda_devices': [0, 1, 2, 3]}"Training with mixed-precision
If your GPU supports it, mixed-precision will be used automatically during training and inference.
Exporting a trained model to HuggingFace Transformers
We have provided a simple script to export a trained model so that it can be loaded with Hugging Face Transformers
python scripts/save_pretrained_hf.py "output" "pretrained"
Evaluating with GLUE
Jiant package is used for evaluation of GLUE benchmark. First, download the specific dataset (CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST, STSB) by the following command:
cd jiant export PYTHONPATH=/path/to/jiant:$PYTHONPATH python jiant/scripts/download_data/runscript.py \ download \ --tasks mrpc \ --output_path data
Evaluate each dataset by loading the exported model from above.
python jiant/proj/simple/runscript.py run --run_name simple --data_dir data --hf_pretrained_model_name_or_path ../pretrained --tasks mrpc --train_batch_size 16 --num_train_epochs 10 --exp_dir roberta
Citing
@article{ye2021efficient, title={Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learnings}, author={Ye, Seonghyeon and Kim, Jiseon and Oh, Alice}, journal={arXiv preprint arXiv::2109.05941}, year={2021} }