Scale Unsloth to multiple GPUs with just torchrun. No configuration files, no custom frameworks - pure PyTorch DDP.
- 🚀 2-4x faster than single GPU
- 🎯 Zero configuration - works out of the box
- 💾 Same VRAM per GPU as single GPU Unsloth
- 🔧 Any Unsloth model - Qwen, Llama, Gemma, etc.
Installation
# Install dependencies
uv add torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
uv add unsloth datasets transformers trl
uv add git+https://github.com/anhvth/opensloth.gitQuick Start
Replace python with torchrun:
# Single GPU python train_scripts/train_ddp.py # Multi-GPU torchrun --nproc_per_node=2 train_scripts/train_ddp.py # 2 GPUs torchrun --nproc_per_node=4 train_scripts/train_ddp.py # 4 GPUs
OpenSloth automatically handles GPU distribution, gradient sync, and batch sizing.
Performance
| Setup | Time | Speedup |
|---|---|---|
| 1 GPU | 19m 34s | 1.0x |
| 2 GPUs | 8m 28s | 2.3x |
Expected scaling: 2 GPUs = ~2.3x, 4 GPUs = ~4.5x, 8 GPUs = ~9x
Usage
from unsloth import FastLanguageModel from trl import SFTConfig, SFTTrainer from opensloth.patching.ddp_patch import ddp_patch ddp_patch() # Enable DDP compatibility # Standard Unsloth setup local_rank = int(os.environ.get("LOCAL_RANK", 0)) model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen3-0.6B", device_map={"": local_rank}, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model(model, r=16) trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...) trainer.train()
Run: torchrun --nproc_per_node=4 your_script.py
Migration from Old Approach
Current (Recommended): Simple torchrun + DDP patch
from opensloth.patching.ddp_patch import ddp_patch ddp_patch() # ... standard Unsloth code
Old Approach (v0.1.8): For complex configuration files, use:
git checkout https://github.com/anhvth/opensloth/releases/tag/v0.1.8
Links
- Unsloth - 2x faster training library
- TRL - Transformer Reinforcement Learning
- PyTorch DDP - Distributed training
git clone https://github.com/anhvth/opensloth.git
cd opensloth
torchrun --nproc_per_node=4 train_scripts/train_ddp.pyHappy training! 🦥⚡
