2025, Dec 18 05:00

Rust vs NumPy on Poisson bootstrap repeat: why np.repeat is faster and how to match it with SIMD memcpy

Learn why NumPy's np.repeat outperforms Rust iterator loops in Poisson bootstrap: memcpy, SIMD, and contiguous copies vs indexing, plus tips to speed up Rust.

When you implement a Poisson bootstrap in Rust, it’s tempting to compare your core building blocks with NumPy. One such block is a repeat function: given an array and a per-element weight, produce a new array where each input value is duplicated according to its weight. A straightforward iterator-based Rust implementation can be surprisingly slower than NumPy’s np.repeat, and even more surprising, np.repeat often outperforms seemingly equivalent NumPy approaches like slicing or np.take.

Problem setup

The operation we’re after looks like this: values [1, 2, 3] with weights [1, 2, 0] should produce [1, 2, 2]. An idiomatic Rust implementation might lean on iterators and repeat_n. Here is a direct translation of that idea.

pub fn inflate(values: &[f64], times: &[u64]) -> Vec<f64> {
    let out: Vec<f64> = values
        .iter()
        .zip(times.iter())
        .flat_map(|(&v, &k)| std::iter::repeat_n(v, k as usize))
        .collect();
    out
}

On the NumPy side, a quick benchmark demonstrates the shape of the difference. Building a list of indices and calling np.take should be logically equivalent to repeating, yet np.repeat is around three times faster than slicing/take in this scenario.

import timeit
import numpy as np
N = 100_000
src = np.random.rand(N)
wts = np.random.poisson(1, len(src))
# Precompute indices so Python looping doesn’t contaminate the timing
idx = []
for w in wts:
    for i in range(w):
        idx.append(i)
print(timeit.timeit(lambda: np.repeat(src, wts), number=1_000))
print(timeit.timeit(lambda: np.take(src, idx), number=1_000))

What’s really causing the gap

The short version is that NumPy takes advantage of wider loads and stores, while the naive Rust version does not. The iterator-based Rust code ends up performing scalar moves, essentially handling one 64-bit value at a time in its hot loops. There is no memcpy involved, and there is no loop-level use of SIMD registers. That alone puts a ceiling on throughput.

NumPy, on the other hand, uses memcpy in nested loops whenever it can. For basic contiguous native arrays, this path applies. memcpy implementations are aggressively tuned and can employ SIMD under the hood on platforms where it pays off. For sufficiently large runs of data, that translates to much wider transfers per instruction and far better sustained bandwidth than scalar moves can achieve. On modern CPUs with wide vector units, this difference easily explains multi-x speedups.

Why is np.repeat so much faster than slicing or np.take? Indexing is hostile to SIMD in this context. With np.take, the implementation needs to read both data and indices, and the access pattern is no longer a contiguous copy. While there are gather instructions on some x86-64 CPUs, they are not as efficient as packed loads/stores in practice, especially for 64-bit items. Even if gathers are available, the extra memory traffic for indices means it cannot beat straight contiguous copies. Put simply, index-based operations are less SIMD-friendly than bulk copies, and that gap shows up clearly in timings.

What to do instead

If you can keep the operation as a sequence of contiguous copies, you want to do that. That’s exactly what makes np.repeat so fast: it devolves to bulk copying when possible. In Rust, aim for an approach that mirrors this behavior rather than iterating scalars; the closer you are to contiguous memcpy-like transfers, the more you benefit from wide loads/stores. Also, make sure you are building and timing optimized code; release builds matter for realistic performance comparisons.

Why this matters

Data engineering and numerical code often swing between two techniques that look equivalent on paper: indexing into a source versus copying contiguous chunks. The former incurs randomish access patterns and extra reads of indices, and the CPU cannot fully light up its vector units. The latter lets the toolchain use well-tuned memcpy routines that exploit SIMD and the memory subsystem efficiently. Understanding this distinction helps you avoid chasing micro-optimizations in the wrong place and directs you toward algorithmic shapes that the hardware favors.

Takeaways

NumPy’s advantage in this case comes from using memcpy, which maps to wide vectorized loads and stores. A naive iterator approach in Rust performs scalar moves and pays for it in throughput. Index-based strategies like slicing or np.take are slower because they cannot leverage the same contiguous-copy path and must process indices in addition to data. Whenever you can, choose designs that let the runtime copy contiguous memory in large chunks, avoid unnecessary indexing, and always measure in optimized builds.