2025, Dec 03 11:00

PyTorch: Ordering to(device) and unsqueeze When Moving Tensors Between CPU and GPU

Learn whether the order of to(device) and unsqueeze matters in PyTorch. Move tensors to target device first for clarity, speed, and less CPU-GPU transfers.

Moving tensors between CPU and GPU in PyTorch often collides with small shape tweaks. A common pattern is calling to(device) somewhere in a chain of tensor ops, and it is easy to swap the order with something lightweight like unsqueeze. The natural question follows: does the order matter for performance and correctness?

Example that triggers the question

Consider two seemingly equivalent snippets. They produce tensors with the same shape, but the calls are ordered differently:

payload.unsqueeze(0).to(target_device)
payload.to(target_device).unsqueeze(0)

What actually happens

Operations are executed in the order they are called. In the first line, unsqueeze happens first and the result is then moved to target_device. In the second line, the tensor is moved first and then unsqueeze is applied.

For unsqueeze specifically, only tensor metadata changes. You are updating shape/stride without touching the underlying data, so the impact from placing it before or after to(target_device) is minimal. The moment you replace unsqueeze with something that touches data, the order may start to matter a lot more.

A simple, consistent way to structure the code

If the intent is to do work on a specific device, it is safer to move the tensor first and then run the operations. This keeps the execution model explicit and consistent.

result = payload.to(target_device).unsqueeze(0)

The same idea can be written in two steps if that reads clearer in your codebase:

tensor_on_dev = payload.to(target_device)
result = tensor_on_dev.unsqueeze(0)

For unsqueeze, both orders yield effectively the same performance characteristics in practice because only metadata is changed. The more general takeaway is about future-proofing: once the chain includes operations that do real work on data, the order can have a significant impact.

Why this is worth remembering

Chaining makes code concise, but it also hides where data lives at each step. Being deliberate about calling to(target_device) up front helps avoid unnecessary transfers and keeps the performance model predictable. When in doubt, run the code and measure; it is the most direct way to see the effect in your context.

Conclusion

Call order matters because PyTorch executes operations in the sequence you specify. For unsqueeze, the difference is negligible since it only adjusts shape/stride. As a habit, move tensors to the target device first and then apply transformations. Keep your device semantics explicit and benchmark when performance matters.