Feat: context parallel v2.0 by S1ro1 · Pull Request #3700 · huggingface/accelerate

I'm curious why this is gated in the first place, can we not use CP in accelerate sans FSDP? They should be independent.

They are (of sorts) but FSDP is a free lunch with CP, so imo it should be the default. While we're computing the ring-attention we can prefetch next fsdp_layer for free, giving us 1/cp_size*fsdp_size savings in model/optimizer/grads. I have some profiling for this in the concept guide.

TLDR: it can be independent, but there's (almost) no world where it's worth to not do FSDP on top.