2025, Sep 16 01:00

Solving LangChain ReAct Agent GPU/CPU Device Mismatch in transformers: Pin HuggingFacePipeline device=0

Getting RuntimeError from a LangChain ReAct agent on GPU? Fix transformers device mismatch by pinning HuggingFacePipeline inputs to cuda:0 using device=0.

When wiring up a LangChain ReAct agent on top of a locally hosted HuggingFace model via transformers, it’s easy to end up with a device mismatch: the model sits on the GPU, while the inputs remain on the CPU. The result is a hard failure inside generation, even if device_map is set to auto and the model loads onto cuda:0 without complaint.

Repro case and context

The setup uses transformers.pipeline for text-generation, wrapped by LangChain’s HuggingFacePipeline, plus a simple DuckDuckGoSearchRun tool. The model initializes on the GPU with device_map="auto", but the AgentExecutor crashes the moment it calls into generate.

RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0
You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cpu, whereas the model is on cuda.

Minimal script

The following reproduces the issue and shows the working adjustment. The logic is unchanged; only local variable names differ.

import os
from langchain_community.tools import DuckDuckGoSearchRun
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import PromptTemplate
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline
import torch

# Path to local model checkpoint directory
ckpt_dir = "../gpt-oss-20b-local"

try:
    # Build a transformers generation pipeline
    gen_pipe = pipeline(
        "text-generation",
        model=ckpt_dir,
        dtype="auto",
        device_map="auto",
        max_new_tokens=256,
    )

    # Wrap for LangChain and explicitly pin device
    core_llm = HuggingFacePipeline(
        pipeline=gen_pipe,
        model_kwargs={"temperature": 0.5, "device": 0},
    )
    print("Local model is ready.")
except Exception as exc:
    print(f"Load failure: {exc}")
    exit()

# One tool for web search
web_search = DuckDuckGoSearchRun()
skillset = [web_search]

# ReAct-style prompt
prompt_str = """
Answer the following questions as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}
"""
schema_prompt = PromptTemplate.from_template(prompt_str)

# Assemble the agent and executor
think_agent = create_react_agent(core_llm, skillset, schema_prompt)
runner = AgentExecutor(agent=think_agent, tools=skillset, verbose=True)

print("Agent initialized.")
print("=" * 50)

# Run a query
user_q = "Who is the current prime minister of the United Kingdom and what is their political party?"
outcome = runner.invoke({"input": user_q})

print("-" * 50)
print(f"Final Response: {outcome['output']}")

What’s actually going wrong

The stack trace spells it out: the model’s weights are on cuda:0, but the input indices used in the embedding lookup arrive on the CPU. During generation, transformers surfaces a warning that input_ids is on cpu while the model is on cuda, and the forward pass eventually fails with “Expected all tensors to be on the same device.” In short, the wrapper isn’t moving inputs to the same device the model is using.

The fix

Pin the device used by the HuggingFacePipeline wrapper. Passing device in model_kwargs ensures inputs are placed on the GPU that holds the model. Setting device to 0 resolves the mismatch:

core_llm = HuggingFacePipeline(
    pipeline=gen_pipe,
    model_kwargs={"temperature": 0.5, "device": 0},
)

This aligns input_ids with the model on cuda:0 and unblocks generation inside the AgentExecutor.

Why this matters

Mixed-device execution leads to hard failures in core operators like embedding and index_select and can also degrade performance if it silently falls back to CPU. Ensuring that the model and its inputs live on the same device is essential for stable inference when coupling transformers pipelines with LangChain’s AgentExecutor.

There’s also a deprecation signal in the logs indicating that LangChain’s HuggingFacePipeline will be removed in a future release and that an updated class exists in the langchain-huggingface package with an import path from langchain_huggingface. Keeping an eye on that migration path can save time during upgrades.

Takeaways

If a LangChain agent throws a device mismatch at generation time even though the model loaded with device_map="auto", enforce the device on the wrapper by passing device: 0. This makes input tensors land on cuda:0 consistently and clears the runtime error. Along the way, watch the deprecation notice for HuggingFacePipeline and plan the import change to langchain_huggingface when appropriate.

The article is based on a question from StackOverflow by meysam and an answer by meysam.