Support model_parallel for LoRA

Hi, I have out of memory problem when using LoRA for LLama.

So I change the Hparams to

device: cuda
model_parallel: true

But I get another CUDA error:

  File "/data//miniconda3/envs/chat310new/lib/python3.10/sitepackages/transformers/models/llama/modeling_llama.py", line 91, in forward
    return self.weight * hidden_states
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Can you help me to use LoRA with multiple GPUs?