MultiPL-T/source_lang_processing at main · nuprl/MultiPL-T

0. Configure environment

pip3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

Source Language Processing Directory

This directory contains all scripts for processing the source Python dataset that will be then used for semi-synthetic data generation.

Extract Python Functions With Docstrings from The Stack

To extract raw Python functions with docstrings from the Stack, run the following command:

python3 generate_from_the_stack --num-workers <cpus needed> --output <output repo name on huggingface>

A pre-extracted dataset is available at nuprl/stack-dedup-python-fns on Hugging Face.

Filter The Dataset

We then filter the dataset to remove functions that do not typecheck or that do not return a value. You can run the following command to filter the dataset:

python3 high_quality_subset.py --dataset <dataset name on huggingface> --push <output repo name on huggingface>

For this step too, a pre-filtered dataset is available at nuprl/stack-dedup-python-fns-returns-typechecks on Hugging Face.

Source Language Test Generation Instructions (Python)

1. Run the code execution docker container

pushd ./code_exec_server/
./build_and_run.sh
popd

2. Run the dataset aggregator server

This will start a server that will listen for incoming requests to add to the dataset. Open a new terminal and run:

python3 dataset_aggregator.py --name <output repo name on huggingface>

3. Run the test generator script

accelerate config # and follow the instructions
accelerate launch test_gen.py

Further Filtering and Transformation -- Remove Benchmark Data and Type Inference

After test generation, further filtering is needed to remove benchmark data and infer python types using test cases. To do this, run the following command:

python3 filter_and_transform.py --dataset <dataset name on huggingface> --result <output repo name on huggingface> --inference

WARNING: this kind of type inference utilizes code execution, which could potentially be dangerous. use with caution.

We provide two pre-processed datasets, one with type inference and one without:

  • Plain: nuprl/stack-dedup-python-testgen-starcoder-filter-v2
  • Type-inferred: nuprl/stack-dedup-python-testgen-starcoder-filter-inferred-v2