2026, Jan 11 15:00

Solving the cuda:0 vs cuda:1 RuntimeError in multi-GPU PEFT/LoRA training with Transformers and device_map='auto'

Hit cuda:0 vs cuda:1 RuntimeError fine-tuning LLMs with PEFT/LoRA and device_map=auto? See the cause and fix: pin Hugging Face Transformers to 4.49.0.

Training large language models with PEFT and LoRA across multiple GPUs is a reliable way to fit heavyweight backbones into limited VRAM. Yet a deceptively simple error can stop the run cold: tensors landing on different CUDA devices during a single operation. Here’s a concise walkthrough of the issue, how it manifests with Hugging Face Transformers and PyTorch, and the exact change that resolved it.

Reproducing the issue

The setup uses Kaggle’s 2xT4 configuration, with a model that does not fit into the memory of a single GPU. The training relies on device_map="auto" to shard the model across GPUs. The dataset contains fields instruction, output, retrieved_context and text, with 7317 rows.

train_ds = load_dataset(data_path, split="train").map(format_prompt_fn)

base_model_id = "yandex/YandexGPT-5-Lite-8B-pretrain"
lm = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

tok = AutoTokenizer.from_pretrained(
    base_model_id, trust_remote_code=True,
    padding_side="left",
    add_eos_token=True, add_bos_token=True,
    use_fast=True
)

tok.pad_token = tok.eos_token

prompt_tag = "### PROMPT:"
answer_tag = "### OUTPUT:"

batch_builder = SafeCollator(
    instruction_template=prompt_tag,
    response_template=answer_tag,
    tokenizer=tok, mlm=False
)

lora_cfg = LoraConfig(...)
sft_cfg = SFTConfig(...)

coach = SFTTrainer(
    lm,
    peft_config=lora_cfg,
    train_dataset=train_ds,
    data_collator=batch_builder,
    args=sft_cfg
)

coach.train()

At runtime the process fails with a device placement error.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

What’s actually going wrong

PyTorch enforces that tensors involved in a single operation reside on the same device. When training a sharded model with device_map="auto", parts of the model legitimately live on different GPUs. That’s fine until an operation tries to mix tensors from, say, cuda:0 and cuda:1 in a way that isn’t supported. The error message is a direct signal that, somewhere in the forward or backward pass, tensors ended up split across devices for a single op.

In this case the model must span both T4s, so relying on automatic placement is reasonable. The core symptom is still the same: mismatched device placement during training.

Fix that worked

Pinning Transformers to an older release resolved the issue. After installing a specific version, training proceeded without the cross-device tensor error.

pip install transformers==4.49.0

No changes to the training script were required beyond using that version.

Working script after applying the fix

train_ds = load_dataset(data_path, split="train").map(format_prompt_fn)

base_model_id = "yandex/YandexGPT-5-Lite-8B-pretrain"
lm = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

tok = AutoTokenizer.from_pretrained(
    base_model_id, trust_remote_code=True,
    padding_side="left",
    add_eos_token=True, add_bos_token=True,
    use_fast=True
)

tok.pad_token = tok.eos_token

prompt_tag = "### PROMPT:"
answer_tag = "### OUTPUT:"

batch_builder = SafeCollator(
    instruction_template=prompt_tag,
    response_template=answer_tag,
    tokenizer=tok, mlm=False
)

lora_cfg = LoraConfig(...)
sft_cfg = SFTConfig(...)

coach = SFTTrainer(
    lm,
    peft_config=lora_cfg,
    train_dataset=train_ds,
    data_collator=batch_builder,
    args=sft_cfg
)

coach.train()

Why this matters

Multi-GPU fine-tuning with PEFT and LoRA is particularly sensitive to framework behavior around device placement. A minor change in the stack can surface as cross-device tensor operations and break training. Pinning the Transformers version provides predictable behavior and prevents runtime surprises, which is essential when the model is sharded to fit into limited GPU memory.

Takeaways

If you hit RuntimeError about tensors on cuda:0 and cuda:1 while training a sharded model with device_map="auto", lock your Transformers dependency to a version that is known to work. In the setup above, transformers==4.49.0 solved the problem and restored stable multi-GPU training with PEFT and LoRA on Kaggle’s 2xT4 configuration.