2025, Oct 21 19:00

Vectorized sliding window processing in PyTorch with torch.unfold: faster 1D tensor operations and cleaner code

Learn how to replace Python loops with a vectorized sliding window using torch.unfold in PyTorch. Boost 1D tensor speed with broadcasting and scalable code.

Sliding windows over 1D tensors are a common pattern in time series, signal processing, and sequence modeling. A straightforward loop with torch.roll works, but it leaves performance on the table and doesn’t scale well. The goal is to compute all windowed results in a single, vectorized pass and assemble them into a 2D tensor.

Baseline loop implementation

The following code processes a 1D tensor by repeatedly taking four adjacent values, feeding them to a function, and shifting the start position by one each iteration. The intermediate outputs are stacked into a 5×4 tensor.

import torch


def apply_block(win: torch.tensor, scale: float) -> torch.tensor:
    return win * scale

win_width = 4
series = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
idx_pick = torch.tensor(list(range(win_width)))

chunks = []
for _ in range(5):
    slice_vals = torch.index_select(series, 0, idx_pick.view(-1))
    out_vec = apply_block(slice_vals, series[0])
    chunks.append(out_vec)
    series = torch.roll(series, -1)

stacked_loop = torch.stack(chunks)

What actually causes the bottleneck

The logic is inherently a sliding window problem: each step uses the next four consecutive elements. Doing this with a Python loop and repeated indexing shifts the work out of fast, vectorized tensor ops. A better approach is to create a 2D view of all windows at once and pass it to the same computation. That view can be formed without copying data using tensor.unfold, which exposes the windows as rows of a matrix.

Vectorized approach with a sliding window view

Below, the same operation is performed in one call by building a 2D “sliding window view” and feeding it to the function. The scale term is taken from the first element of each window and broadcast to match shapes.

import torch


def apply_block(win: torch.tensor, scale: float) -> torch.tensor:
    return win * scale

win_width = 4
base = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])

windows = base.unfold(0, win_width, 1)
vectorized_out = apply_block(windows, windows[:, 0:1])

# Optional cross-check against the loop version
check_series = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
idx_pick = torch.tensor(list(range(win_width)))
accum = []
for _ in range(5):
    sl = torch.index_select(check_series, 0, idx_pick.view(-1))
    accum.append(apply_block(sl, check_series[0]))
    check_series = torch.roll(check_series, -1)
stacked_loop = torch.stack(accum)

assert torch.allclose(stacked_loop, vectorized_out)

Why this matters

Replacing Python-side loops with a single tensor operation keeps the workload inside PyTorch’s optimized execution path. The code becomes clearer: the 2D window matrix makes the intent explicit, and broadcasting the scale along the second dimension aligns naturally with the data layout. This is especially helpful when the number of windows grows, where loop overhead and repeated indexing become noticeable.

Conclusion

When you need consecutive groups from a tensor, construct a sliding window view with unfold and apply your computation once across all windows. Keep the inputs as tensors to use torch.index_select when you do need indexing, and rely on broadcasting to match shapes without extra copies. This approach preserves the original logic while improving readability and scalability.

The article is based on a question from StackOverflow by FlumeRS and an answer by simon.