GitHub - gmarko/unsloth: 5X faster 60% less memory QLoRA finetuning

Llama 7b	Mistral 7b	CodeLlama 34b	Llama 7b Kaggle 2x T4
2.2x faster, -43% VRAM	2.2x faster, -62% VRAM	1.9x faster, -27% VRAM	5.5x faster, -44% VRAM
Free Colab Alpaca dataset example	Free Colab Alpaca dataset example	Colab A100 example	Kaggle Alpaca example
Colab A100 example	Colab A100 example	(59 more examples if you scroll down)	Kaggle Slim Orca example

Supports Llama, Yi, Mistral, CodeLlama, and their derived models (Open Hermes etc).
All kernels written in OpenAI's Triton language. Manual backpropagation engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware necessary. Supports NVIDIA GPUs since 2018+. Minimum CUDA Compute Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU
NEW! Works on Linux and Windows via WSL.
NEW! Support for DPO (Direct Preference Optimization), PPO and Reward Modelling via TRL.
NEW! Download 4 bit models 4x faster directly from Huggingface!
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source version trains 5x faster - check out Unsloth Max for 30x faster training!

1 A100 40GB	Huggingface	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.04x	1.98x	2.48x	5.32x	15.64x
LAION Chip2	1x	0.92x	1.61x	1.84x	7.05x	20.73x
OASST	1x	1.19x	2.17x	2.66x	5.04x	14.83x
Slim Orca	1x	1.18x	2.22x	2.64x	5.04x	14.82x

Join our Discord! If you trained a model with Unsloth, we made a cool sticker if you want to use it!

Installation Instructions - Conda

Select either pytorch-cuda=11.8 for CUDA 11.8 or pytorch-cuda=12.1 for CUDA 12.1.

conda install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=12.1 \
  -c pytorch -c nvidia -c xformers -c conda-forge -y
pip install "unsloth[conda] @ git+https://github.com/unslothai/unsloth.git"

Installation Instructions - Pip

Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.

Find your CUDA version via

import torch; torch.version.cuda

For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path.

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"

For Pytorch 2.1.1: Use the "ampere" path for newer RTX 30xx GPUs or higher.

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118_torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118_ampere_torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere_torch211] @ git+https://github.com/unslothai/unsloth.git"

We're working on Pytorch 2.1.2 support.
If you get errors, try the below first, then go back to step 1:

pip install --upgrade pip

Documentation

We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-2-7b", # Supports Llama, Mistral - replace this!
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

DPO (Direct Preference Optimization) Support

DPO, PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory.

Future Milestones and limitations

Support Mixtral.
Does not support non Llama models - we do so in the future.

Performance comparisons on 1 Tesla T4 GPU:

Time taken for 1 epoch

One Tesla T4 on Google Colab bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	23h 15m	56h 28m	8h 38m	391h 41m
Unsloth Open	1 T4	13h 7m (1.8x)	31h 47m (1.8x)	4h 27m (1.9x)	240h 4m (1.6x)
Unsloth Pro	1 T4	3h 6m (7.5x)	5h 17m (10.7x)	1h 7m (7.7x)	59h 53m (6.5x)
Unsloth Max	1 T4	2h 39m (8.8x)	4h 31m (12.5x)	0h 58m (8.9x)	51h 30m (7.6x)

Peak Memory Usage

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	7.3GB	5.9GB	14.0GB	13.3GB
Unsloth Open	1 T4	6.8GB	5.7GB	7.8GB	7.7GB
Unsloth Pro	1 T4	6.4GB	6.4GB	6.4GB	6.4GB
Unsloth Max	1 T4	11.4GB	12.4GB	11.9GB	14.4GB

Performance comparisons on 2 Tesla T4 GPUs via DDP:

Time taken for 1 epoch

Two Tesla T4s on Kaggle bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	84h 47m	163h 48m	30h 51m	1301h 24m *
Unsloth Pro	2 T4	3h 20m (25.4x)	5h 43m (28.7x)	1h 12m (25.7x)	71h 40m (18.1x) *
Unsloth Max	2 T4	3h 4m (27.6x)	5h 14m (31.3x)	1h 6m (28.1x)	54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	8.4GB \| 6GB	7.2GB \| 5.3GB	14.3GB \| 6.6GB	10.9GB \| 5.9GB *
Unsloth Pro	2 T4	7.7GB \| 4.9GB	7.5GB \| 4.9GB	8.5GB \| 4.9GB	6.2GB \| 4.7GB *
Unsloth Max	2 T4	10.5GB \| 5GB	10.6GB \| 5GB	10.6GB \| 5GB	10.5GB \| 5GB *

Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.

Llama-Factory 3rd party benchmarking

Method	Bits	TGS	GRAM	Speed
HF	16	2392	18GB	100%
HF+FA2	16	2954	17GB	123%
Unsloth+FA2	16	4007	16GB	168%
HF	4	2415	9GB	101%
Unsloth+FA2	4	3726	7GB	160%

Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.

How did we make it faster?

Manual autograd, Triton kernels etc. See our Benchmark Breakdown for more info!

$$ \begin{align} y &= \frac{x_i}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \cdot w \\ r &= \frac{1}{\sqrt{\frac{1}{n}\sum{x_i^2}+\epsilon}} \\ \frac{dC}{dX} &= \frac{1}{n} r \bigg( n (dY \cdot w) - \bigg( x_i \cdot r \cdot \sum{dY \cdot y_i } \bigg) \bigg) \end{align} $$

Troubleshooting

Sometimes bitsandbytes or xformers does not link properly. Try running:

!ldconfig /usr/lib64-nvidia

Windows is not supported as of yet - we rely on Xformers and Triton support, so until both packages support Windows officially, Unsloth will then support Windows.
If it doesn't install - maybe try updating pip.

Full benchmarking tables

Click "Code" for a fully reproducible example. "Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.04x	1.98x	2.48x	5.32x	15.64x
code	Code	Code	Code	Code
seconds	1040	1001	525	419	196	67
memory MB	18235	15365	9631	8525
% saved		15.74	47.18	53.25

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
LAION Chip2	1x	0.92x	1.61x	1.84x	7.05x	20.73x
code	Code	Code	Code	Code
seconds	581	631	361	315	82	28
memory MB	7763	8047	7763	6441
% saved		-3.66	0.00	17.03

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
OASST	1x	1.19x	2.17x	2.66x	5.04x	14.83x
code	Code	Code	Code	Code
seconds	1852	1558	852	696	367	125
memory MB	26431	16565	12267	11223
% saved		37.33	53.59	57.54

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Slim Orca	1x	1.18x	2.22x	2.64x	5.04x	14.82x
code	Code	Code	Code	Code
seconds	1824	1545	821	691	362	123
memory MB	24557	15681	10595	9007
% saved		36.14	56.86	63.32

Mistral 7b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Mistral 7B Slim Orca	1x	1.15x	2.15x	2.53x	4.61x	13.69x
code	Code	Code	Code	Code
seconds	1813	1571	842	718	393	132
memory MB	32853	19385	12465	10271
% saved		40.99	62.06	68.74

CodeLlama 34b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Code Llama 34B	OOM ❌	0.99x	1.87x	2.61x	4.27x	12.82x
code	Code	Code	Code	Code
seconds	1953	1982	1043	748	458	152
memory MB	40000	33217	27413	22161
% saved		16.96	31.47	44.60

1 Tesla T4

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.09x	1.69x	1.79x	2.93x	8.3x
code	Code	Code	Code	Code
seconds	1599	1468	942	894	545	193
memory MB	7199	7059	6459	5443
% saved		1.94	10.28	24.39

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
LAION Chip2	1x	0.99x	1.80x	1.75x	4.15x	11.75x
code	Code	Code	Code	Code
seconds	952	955	529	543	229	81
memory MB	6037	6033	5797	4855
% saved		0.07	3.98	19.58

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
OASST	1x	1.19x	1.95x	1.86x	2.58x	7.3x
code	Code	Code	Code	Code
seconds	2640	2222	1355	1421	1024	362
memory MB	14827	10391	8413	7031
% saved		29.92	43.26	52.58

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
Slim Orca	1x	1.21x	1.77x	1.85x	2.71x	7.67x
code	Code	Code	Code	Code
seconds	2735	2262	1545	1478	1009	356
memory MB	13933	10489	7661	6563
% saved		24.72	45.02	52.90

2 Tesla T4s via DDP

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	0.99x	4.95x	4.44x	7.28x	20.61x
code	Code	Code	Code
seconds	9882	9946	1996	2227	1357	480
memory MB	9176	9128	6904	6782
% saved		0.52	24.76	26.09

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
LAION Chip2	1x	1.12x	5.28x	4.21x	10.01x	28.32x
code	Code	Code	Code
seconds	5418	4854	1027	1286	541	191
memory MB	7316	7316	5732	5934
% saved		0.00	21.65	18.89

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
OASST (bsz=1)	1x	1.14x	5.56x	5.09x	5.64x	15.97x
code	Code	Code	Code
seconds	4503	3955	811	885	798	282
memory MB	11896	11628	6616	7105
% saved		2.25	44.38	40.27

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Slim Orca (bsz=1)	1x	0.97x	5.54x	4.68x	6.88x	19.46x
code	Code	Code	Code
seconds	4042	4158	729	863	588	208
memory MB	11010	11042	6492	7410
% saved		-0.29	41.04	32.70

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
OASST (bsz=2)	OOM ❌	OOM ❌	✓	✓	✓	✓
code	Code	Code	Code
seconds	OOM	OOM	2719	3391	2794	987
memory MB	OOM	OOM	8134	9600
% saved	OOM	OOM

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Slim Orca (bsz=2)	OOM ❌	OOM ❌	✓	✓	✓	✓
code	Code	Code	Code
seconds	OOM	OOM	2990	3444	2351	831
memory MB	OOM	OOM	7594	8881
% saved	OOM	OOM

Credits

RandomInternetPreson for confirming WSL support
152334H for experimental DPO support
atgctg for syntax highlighting