CodeGen2.5
Official research release for the CodeGen2.5 models for Program Synthesis.
Title: CodeGen2.5: Small, but mighty
Authors: Erik Nijkamp*, Hiroaki Hayashi*, Yingbo Zhou, Caiming Xiong (* equal contribution)
Hugging Face Integration
Model checkpoints are published at Hugging Face Hub.
- CodeGen2.5-7B-multi (Apache-2.0)
- CodeGen2.5-7B-mono (Apache-2.0)
- CodeGen2.5-7B-instruct (Research purposes only)
Model cards outline how to use the model for causal and infill sampling. Please refer to each model card for more details.
The models are pre-trained on the StarCoderData, a programming language dataset developed by BigCode.
Requirements
transformers>=4.29.2
tiktoken==0.4.0
Sampling
Program synthesis in the form of auto-regressive sampling can be performed as follows:
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-mono", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-mono") inputs = tokenizer("def hello_world():", return_tensors="pt") sample = model.generate(**inputs, max_length=128) print(tokenizer.decode(sample[0]))
Citation
Please cite CodeGen2 paper:
@article{Nijkamp2023codegen2, title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages}, author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo}, journal={ICLR}, year={2023} }