GitHub - kno10/unsloth: 5X faster 60% less memory QLoRA finetuning

Llama 7b	Mistral 7b	CodeLlama 34b	Llama 7b Kaggle 2x T4
2.2x faster, -43% VRAM	2.2x faster, -62% VRAM	1.9x faster, -27% VRAM	5.5x faster, -44% VRAM
Colab Alpaca example + inference	Colab T4 example	A100 example	Kaggle Alpaca example
Colab A100 example	Colab A100 example	(59 more examples if you scroll down)	Kaggle Slim Orca

Supports Llama (7, 13, 70b), Yi (6, 34b), Mistral (7b), Tinyllama, CodeLlama (7, 13, 34b), and all Llama / Mistral derived architectures!
All kernels written in OpenAI's Triton language.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware necessary. Supports NVIDIA GPUs since 2018+. Minimum CUDA Compute Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU
NEW! Works on Linux and Windows via WSL.
NEW! Experimental support for DPO (Direct Preference Optimization)!
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source version trains 5x faster or you can check out Unsloth Pro and Max codepaths for 30x faster training!

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.04x	1.98x	2.48x	5.32x	15.64x
LAION Chip2	1x	0.92x	1.61x	1.84x	7.05x	20.73x
OASST	1x	1.19x	2.17x	2.66x	5.04x	14.83x
Slim Orca	1x	1.18x	2.22x	2.64x	5.04x	14.82x

Join our Discord! If you trained a model with Unsloth, we made a cool sticker!!

Installation Instructions - Conda

Unsloth currently only supports Linux distros and Pytorch == 2.1.

conda install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=12.1 \
  -c pytorch -c nvidia -c xformers -c conda-forge -y
pip install "unsloth[kaggle] @ git+https://github.com/unslothai/unsloth.git"

Installation Instructions - Pip

Find your CUDA version via

import torch; torch.version.cuda

We only support Pytorch 2.1 (2.1.1 bugs out for now): You can update Pytorch via Pip (interchange cu121 / cu118)

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path.

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"

Change cu121 to cu118 for CUDA version 11.8 or 12.1. Go to https://pytorch.org/ to learn more.

If you get errors, try the below first, then go back to step 1:

pip install --upgrade pip

Documentation

We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!

from unsloth import FastLlamaModel, FastMistralModel
import torch
max_seq_length = 2048 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = "unsloth/llama-2-7b", # Supports any llama model eg meta-llama/Llama-2-7b-hf
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# Do model patching and add fast LoRA weights
model = FastLlamaModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

trainer = .... Use Huggingface's Trainer and dataset loading (TRL, transformers etc)

DPO (Direct Preference Optimization) Experimental support

152334H hacked Unsloth to work with DPO via TRL!

Hack the model's config.json to be llama model. Example gist.
Use Unsloth for DPO for both base and reference models. Example gist.

Future Milestones and limitations

Support Mixtral.
Does not support non Llama models - we do so in the future.

Performance comparisons on 1 Tesla T4 GPU:

Time taken for 1 epoch

One Tesla T4 on Google Colab bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	23h 15m	56h 28m	8h 38m	391h 41m
Unsloth Open	1 T4	13h 7m (1.8x)	31h 47m (1.8x)	4h 27m (1.9x)	240h 4m (1.6x)
Unsloth Pro	1 T4	3h 6m (7.5x)	5h 17m (10.7x)	1h 7m (7.7x)	59h 53m (6.5x)
Unsloth Max	1 T4	2h 39m (8.8x)	4h 31m (12.5x)	0h 58m (8.9x)	51h 30m (7.6x)

Peak Memory Usage

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	7.3GB	5.9GB	14.0GB	13.3GB
Unsloth Open	1 T4	6.8GB	5.7GB	7.8GB	7.7GB
Unsloth Pro	1 T4	6.4GB	6.4GB	6.4GB	6.4GB
Unsloth Max	1 T4	11.4GB	12.4GB	11.9GB	14.4GB

Performance comparisons on 2 Tesla T4 GPUs via DDP:

Time taken for 1 epoch

Two Tesla T4s on Kaggle bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	84h 47m	163h 48m	30h 51m	1301h 24m *
Unsloth Pro	2 T4	3h 20m (25.4x)	5h 43m (28.7x)	1h 12m (25.7x)	71h 40m (18.1x) *
Unsloth Max	2 T4	3h 4m (27.6x)	5h 14m (31.3x)	1h 6m (28.1x)	54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	8.4GB \| 6GB	7.2GB \| 5.3GB	14.3GB \| 6.6GB	10.9GB \| 5.9GB *
Unsloth Pro	2 T4	7.7GB \| 4.9GB	7.5GB \| 4.9GB	8.5GB \| 4.9GB	6.2GB \| 4.7GB *
Unsloth Max	2 T4	10.5GB \| 5GB	10.6GB \| 5GB	10.6GB \| 5GB	10.5GB \| 5GB *

Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.

Full benchmarking tables

Click "Code" for a fully reproducible example. "Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.04x	1.98x	2.48x	5.32x	15.64x
code	Code	Code	Code	Code
seconds	1040	1001	525	419	196	67
memory MB	18235	15365	9631	8525
% saved		15.74	47.18	53.25

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
LAION Chip2	1x	0.92x	1.61x	1.84x	7.05x	20.73x
code	Code	Code	Code	Code
seconds	581	631	361	315	82	28
memory MB	7763	8047	7763	6441
% saved		-3.66	0.00	17.03

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
OASST	1x	1.19x	2.17x	2.66x	5.04x	14.83x
code	Code	Code	Code	Code
seconds	1852	1558	852	696	367	125
memory MB	26431	16565	12267	11223
% saved		37.33	53.59	57.54

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Slim Orca	1x	1.18x	2.22x	2.64x	5.04x	14.82x
code	Code	Code	Code	Code
seconds	1824	1545	821	691	362	123
memory MB	24557	15681	10595	9007
% saved		36.14	56.86	63.32

Mistral 7b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Mistral 7B Slim Orca	1x	1.15x	2.15x	2.53x	4.61x	13.69x
code	Code	Code	Code	Code
seconds	1813	1571	842	718	393	132
memory MB	32853	19385	12465	10271
% saved		40.99	62.06	68.74

CodeLlama 34b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Code Llama 34B	OOM ❌	0.99x	1.87x	2.61x	4.27x	12.82x
code	Code	Code	Code	Code
seconds	1953	1982	1043	748	458	152
memory MB	40000	33217	27413	22161
% saved		16.96	31.47	44.60

1 Tesla T4

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.09x	1.69x	1.79x	2.93x	8.3x
code	Code	Code	Code	Code
seconds	1599	1468	942	894	545	193
memory MB	7199	7059	6459	5443
% saved		1.94	10.28	24.39

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
LAION Chip2	1x	0.99x	1.80x	1.75x	4.15x	11.75x
code	Code	Code	Code	Code
seconds	952	955	529	543	229	81
memory MB	6037	6033	5797	4855
% saved		0.07	3.98	19.58

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
OASST	1x	1.19x	1.95x	1.86x	2.58x	7.3x
code	Code	Code	Code	Code
seconds	2640	2222	1355	1421	1024	362
memory MB	14827	10391	8413	7031
% saved		29.92	43.26	52.58

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
Slim Orca	1x	1.21x	1.77x	1.85x	2.71x	7.67x
code	Code	Code	Code	Code
seconds	2735	2262	1545	1478	1009	356
memory MB	13933	10489	7661	6563
% saved		24.72	45.02	52.90

2 Tesla T4s via DDP

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	0.99x	4.95x	4.44x	7.28x	20.61x
code	Code	Code	Code
seconds	9882	9946	1996	2227	1357	480
memory MB	9176	9128	6904	6782
% saved		0.52	24.76	26.09

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
LAION Chip2	1x	1.12x	5.28x	4.21x	10.01x	28.32x
code	Code	Code	Code
seconds	5418	4854	1027	1286	541	191
memory MB	7316	7316	5732	5934
% saved		0.00	21.65	18.89

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
OASST (bsz=1)	1x	1.14x	5.56x	5.09x	5.64x	15.97x
code	Code	Code	Code
seconds	4503	3955	811	885	798	282
memory MB	11896	11628	6616	7105
% saved		2.25	44.38	40.27

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Slim Orca (bsz=1)	1x	0.97x	5.54x	4.68x	6.88x	19.46x
code	Code	Code	Code
seconds	4042	4158	729	863	588	208
memory MB	11010	11042	6492	7410
% saved		-0.29	41.04	32.70

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
OASST (bsz=2)	OOM ❌	OOM ❌	✓	✓	✓	✓
code	Code	Code	Code
seconds	OOM	OOM	2719	3391	2794	987
memory MB	OOM	OOM	8134	9600
% saved	OOM	OOM

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Slim Orca (bsz=2)	OOM ❌	OOM ❌	✓	✓	✓	✓
code	Code	Code	Code
seconds	OOM	OOM	2990	3444	2351	831
memory MB	OOM	OOM	7594	8881
% saved	OOM	OOM

How did we make it faster?

Manual autograd, Triton kernels etc. See our Benchmark Breakdown for more info!

$$ \begin{align} y &= \frac{x_i}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \cdot w \\ y &= \frac{x_i}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \cdot w \\ r &= \frac{1}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \\ \frac{dC}{dX} &= \frac{1}{n} r \bigg( n (dY \cdot w) - \bigg( x_i \cdot r \cdot \sum{dY \cdot y_i } \bigg) \bigg) \end{align} $$

Troubleshooting

Sometimes bitsandbytes or xformers does not link properly. Try running:

!ldconfig /usr/lib64-nvidia

Windows is not supported as of yet - we rely on Xformers and Triton support, so until both packages support Windows officially, Unsloth will then support Windows.
If it doesn't install - maybe try updating pip.

Credits

RandomInternetPreson for confirming WSL support
152334H for experimental DPO support
atgctg for syntax highlighting