GitHub - MachineLearningSystem/DrMAS: Dr. MAS is an end-to-end RL training framework for multi-agent LLM systems, supporting the co-training of multiple (heterogeneous) LLMs.

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Dr.MAS is designed for end-to-end post-training of Multi-Agent LLM Systems via Reinforcement Learning (RL), enabling multiple LLM-based agents to collaborate on complex reasoning and decision-making tasks.

This framework features flexible agent registry, customizable multi-agent orchestration, LLM sharing/non-sharing (e.g., heterogeneous LLMs), per-agent configuration, and shared resource pooling, making it well suited for training multi-agent LLM systems with RL.

Feature Summary

Feature Category	Supported Capabilities
Flexible Agent Registry	✅ User-defined agent registration via `@AgentRegistry.register` ✅ Clear role specialization per agent
Multi-Agent Orchestration	✅ User-defined multi-agent orchestration ✅ Sequential, hierarchical, and conditional workflows ✅ Built-in Search/Math Orchestra
Agent-Model Assignment	✅ Logical agents (1,...,K) mapped to LLM worker groups ✅ LLM non-sharing: one LLM per agent (supports heterogeneous model families/checkpoints) ✅ LLM Sharing: agents using the same model share one LLM worker group
Per-Agent Configuration	✅ Per-agent training overrides for fine-grained control ✅ Per-agent learning rates, micro-batch sizes, and other hyperparameters
Shared Resource Pooling	✅ Shared GPU pool across multiple LLM worker groups for efficient hardware utilization ✅ Gradient updates applied independently for each worker group during optimization
Environments	✅ Math ✅ Search
Model Support	✅ Qwen2.5 ✅ Qwen3 ✅ LLaMA3.2 and more
RL Algorithms	✅ Dr.MAS ✅ GRPO 🧪 GiGPO (experimental) 🧪 DAPO (experimental) 🧪 RLOO (experimental) 🧪 PPO (experimental) and more

Installation

Install veRL

conda create -n DrMAS python==3.12 -y
conda activate DrMAS

pip3 install -r requirements_sglang.txt
pip3 install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

pip3 install -e .

Install Supported Environments

1. Search

conda activate DrMAS
cd ./agent_system/environments/env_package/search/third_party
pip install -e .
pip install gym==0.26.2

Prepare dataset:

cd repo_root/
python examples/data_preprocess/drmas_search.py

Since faiss-gpu is not available via pip, we setup a separate conda environment for the local retrieval server. Running this server will use around 6GB of GPU memory per GPU, so make sure to account for this in your training run configuration. Build Retriever environments:

conda create -n retriever python=3.10 -y
conda activate retriever

conda install numpy==1.26.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers datasets pyserini huggingface_hub
conda install faiss-gpu==1.8.0 -c pytorch -c nvidia -y
pip install uvicorn fastapi

Download the index:

conda activate retriever

local_dir=~/data/searchR1
python examples/search/searchr1_download.py --local_dir $local_dir
cat $local_dir/part_* > $local_dir/e5_Flat.index
gzip -d $local_dir/wiki-18.jsonl.gz

Start the local flat e5 retrieval server:

conda activate retriever

# redirect the output to a file to avoid cluttering the terminal
# we have observed outputting to the terminal causing spikes in server response times
bash examples/search/retriever/retrieval_launch.sh > retrieval_server.log

2. Math

Prepare the dataset:

cd repo_root/
python examples/data_preprocess/drmas_math.py

Run Examples

Search

Search (hierarchical routing): a 3-agent hierarchy where Verifier decides whether information is sufficient; it routes to Search Agent (generate queries) or Answer Agent (final response). See agent_system/agent/orchestra/search/README.md.

bash examples/drmas_trainer/run_search.sh

After training completes, evaluate the multi-agent system on the full test dataset:

bash examples/drmas_trainer/run_search.sh eval

Math

Math (iterative refinement): a 2-agent loop where Solver proposes step-by-step solutions and Verifier checks them; items are iterated until approved or max loops reached. See agent_system/agent/orchestra/math/README.md.

bash examples/drmas_trainer/run_math.sh

After training completes, evaluate the multi-agent system on the full test dataset:

bash examples/drmas_trainer/run_math.sh eval

Compute Requirements

1. Search Task

Strategy	Model Configuration	Resources	Est. Time
LLM Sharing	1 $\times$ Qwen2.5-3B	4 $\times$ H100	~12h
LLM Non-Sharing	3 $\times$ Qwen2.5-3B	4 $\times$ H100	~13h
LLM Sharing	1 $\times$ Qwen2.5-7B	8 $\times$ H100	~25h
LLM Non-Sharing	3 $\times$ Qwen2.5-7B	8 $\times$ H100	~26h
Heterogeneous	2 $\times$ Llama-3.2-3B + 1 $\times$ Qwen2.5-7B	8 $\times$ H100	~16h

2. Math Task

Strategy	Model Configuration	Resources	Est. Time
LLM Sharing	1 $\times$ Qwen3-4B	4 $\times$ H100	~35h
LLM Non-Sharing	2 $\times$ Qwen3-4B	4 $\times$ H100	~38h
LLM Sharing	1 $\times$ Qwen3-8B	8 $\times$ H100	~37h
LLM Non-Sharing	2 $\times$ Qwen3-8B	8 $\times$ H100	~42h

Multi-Agent Development Guide

For a comprehensive guide on developing custom multi-agent LLM systems, including detailed examples and best practices, see the Multi-Agent Development Guide.

The guide covers:

Architecture overview and core components
Step-by-step agent creation and registration
Orchestra development patterns
Configuration options and per-agent parameter overrides

Acknowledgement

This codebase is built upon verl-agent and verl. The Search environment is adapted from Search-R1 and SkyRL-Gym. The Math environment is adapted from DeepScaleR and DAPO.

We extend our gratitude to the authors and contributors of these projects for their valuable work.