https://pytroubles.com/en/posts/id1931-efficient-numpy-vectorization-for-jagged-diagonal-sum-in-3d-tensors-index-packing-and-reuse

Efficient NumPy vectorization for jagged diagonal sum in 3D tensors: index packing and reuse

Fast jagged diagonal summation in 3D NumPy arrays with vectorized index formation and reusable buffers

Efficient NumPy vectorization for jagged diagonal sum in 3D tensors: index packing and reuse

Learn how to compute b[i,j]=sum_k a[i+k,j+k,k] in NumPy fast: build diagonal indices, avoid loops, and cache buffers for repeated runs. With code, benchmarks.

2025-11-17T09:00:10+03:00

2025-11-17T09:00:11+03:00

Summing a jagged diagonal across a 3D array sounds trivial until you try to vectorise it. Given a cubic tensor a with shape N×N×N, the goal is to compute b such that b[i, j] = Σ_k a[i+k, j+k, k]. A straightforward pair of loops with an einsum-based inner computation works, but it is slow. Below is a practical way to compute this efficiently in NumPy, and a path to make it even faster when you need to run it many times with the same shape and dtype.Baseline: the direct approachThe following snippet illustrates the naïve implementation. It is correct, but the nested loops and repeated slicing are expensive.import numpy as np N = 10 cube = np.random.rand(N, N, N) grid = np.zeros((N, N)) for r in range(N): for c in range(N): span = np.maximum(r, c) grid[r, c] = np.einsum('iii', cube[r:N-span+r, c:N-span+c, :N-span]) Why this is slowThe summation is not over a rectangular block; each (i, j) pair requires walking a diagonal of varying length that depends on both i and j. This "jagged" nature resists direct broadcasting tricks. Iterating per element and slicing inside the loop amplifies overhead.Efficient computation via index formationA faster option builds indices that select all needed diagonal segments for each segment length, then reduces along the appropriate axis. This avoids per-element Python overhead and leverages NumPy’s vectorised indexing and summation.import time import numpy as np def pairwise_scan(cube: np.ndarray) -> np.ndarray: grid = np.zeros(cube.shape[:2], dtype=cube.dtype) for r in range(cube.shape[0]): for c in range(cube.shape[1]): kmax = min(cube.shape[0] - r, cube.shape[1] - c) for k in range(kmax): grid[r, c] += cube[r+k, c+k, k] return grid def diagonal_pack(cube: np.ndarray) -> np.ndarray: h, w = hw = cube.shape[:2] grid = np.zeros(hw, dtype=cube.dtype) for seg in range(1, 1 + w): idx_mat = np.arange(w - seg + 1)[:, np.newaxis] + np.arange(seg) tail = range(w - seg, w) head = range(seg) grid[w - seg, range(1 + w - seg)] = cube[ tail, idx_mat, head, ].sum(axis=1) grid[range(w - seg), w - seg] = cube[ idx_mat[:-1, :], tail, head, ].sum(axis=1) return grid def diag_trace(cube: np.ndarray) -> np.ndarray: N = cube.shape[0] grid = np.zeros((N, N)) for r in range(N): for c in range(N): u = cube[r:, c:, :] grid[r, c] = np.trace([u[k, k] for k in range(min(u.shape))]) return grid def run_check() -> None: rng = np.random.default_rng(seed=0) cube = rng.integers(low=0, high=10, size=(12, 12, 12)) g1 = pairwise_scan(cube) g2 = diagonal_pack(cube) g3 = diag_trace(cube) assert np.array_equal(g1, g2) assert np.array_equal(g1, g3.astype(cube.dtype)) def timeit_run() -> None: rng = np.random.default_rng(seed=0) cube = rng.integers(low=0, high=99, size=(100, 100, 100)) for fn in (diagonal_pack, diag_trace): t0 = time.perf_counter() fn(cube) t1 = time.perf_counter() print(f'{fn.__name__}: {t1-t0:.4f}') if __name__ == '__main__': run_check() timeit_run() On a representative run, the timings were:semivectorised: 0.0082 literal_trace: 0.2085 This demonstrates that building and summing over preformed indices is significantly faster than a literal diagonal tracing approach in this context.When the operation repeats: cache indices and reuse buffersIn many real workloads the same computation is invoked repeatedly with constant shape and dtype. A common pattern is to form a by multiplying a precomputed x with a per-call y reshaped as y[..., np.newaxis]. When the tensor shape and dtype are invariant across calls, you can precompute the indexing once and reuse temporary buffers to reduce allocation overhead.import typing import numpy as np class DiagRunner(typing.NamedTuple): n: int buf3: np.ndarray out2: np.ndarray slots: tuple[tuple, ...] @classmethod def from_like(cls, arr: np.ndarray) -> typing.Self: return cls.make(n=arr.shape[0], dtype=arr.dtype) @classmethod def make(cls, n: int, dtype: np.dtype) -> typing.Self: ir = np.arange(1 + n, dtype=np.int32) ij = ir[:, np.newaxis] + ir return cls( n=n, buf3=np.empty((n, n, n), dtype=dtype), out2=np.empty((n, n), dtype=dtype), slots=tuple([ ( (tail := range(n - seg, n), ij[:n - seg + 1, :seg], head := range(seg)), (ij[:n - seg, :seg], tail, head), (n - seg, range(1 + n - seg)), (range(n - seg), n - seg), ) for seg in range(1, 1 + n) ]), ) def apply(self, x: np.ndarray, y: np.ndarray) -> np.ndarray: a, b = self.buf3, self.out2 np.multiply(x, y[..., np.newaxis], out=a) for av1, av2, bv1, bv2 in self.slots: b[bv1] = a[av1].sum(axis=1) b[bv2] = a[av2].sum(axis=1) return b # usage # runner = DiagRunner.from_like(x) # result = runner.apply(x, y) This strategy aligns with workloads where the computation is executed 10²–10⁴ times, the shape is around the order of 10² per axis, and the data type is a complex float, while x is shared and y varies per call. By caching the indexing plan and reusing the same work buffers, you minimise both Python-level overhead and per-call allocations.Why this mattersUnderstanding how to vectorise non-rectangular reductions is a practical skill for high-performance NumPy. It lets you stay in pure NumPy without resorting to cython. For repeated evaluations, pushing the indexing and memory layout decisions outside the hot path yields consistent speed-ups and avoids turning indexing itself into the bottleneck.TakeawaysIf you need b[i, j] = Σ_k a[i+k, j+k, k], avoid per-element loops and slicing. Build index sets per diagonal length, sum along the last axis, and write into the appropriate rows and columns of the output. When running the same operation many times with constant shape and dtype, precompute the indexing and reuse buffers. Validate correctness against a simple reference and profile on your actual data sizes to confirm the expected gains.

NumPy, jagged diagonal, 3D tensor, vectorization, diagonal sum, index formation, performance optimization, caching indices, reusable buffers, einsum, np.trace, Python

2025

2025, Nov 17 09:00

Fast jagged diagonal summation in 3D NumPy arrays with vectorized index formation and reusable buffers

Learn how to compute b[i,j]=sum_k a[i+k,j+k,k] in NumPy fast: build diagonal indices, avoid loops, and cache buffers for repeated runs. With code, benchmarks.

Baseline: the direct approach

The following snippet illustrates the naïve implementation. It is correct, but the nested loops and repeated slicing are expensive.

import numpy as np
N = 10
cube = np.random.rand(N, N, N)
grid = np.zeros((N, N))
for r in range(N):
    for c in range(N):
        span = np.maximum(r, c)
        grid[r, c] = np.einsum('iii', cube[r:N-span+r, c:N-span+c, :N-span])

Why this is slow

The summation is not over a rectangular block; each (i, j) pair requires walking a diagonal of varying length that depends on both i and j. This "jagged" nature resists direct broadcasting tricks. Iterating per element and slicing inside the loop amplifies overhead.

Efficient computation via index formation

A faster option builds indices that select all needed diagonal segments for each segment length, then reduces along the appropriate axis. This avoids per-element Python overhead and leverages NumPy’s vectorised indexing and summation.

import time
import numpy as np
def pairwise_scan(cube: np.ndarray) -> np.ndarray:
    grid = np.zeros(cube.shape[:2], dtype=cube.dtype)
    for r in range(cube.shape[0]):
        for c in range(cube.shape[1]):
            kmax = min(cube.shape[0] - r, cube.shape[1] - c)
            for k in range(kmax):
                grid[r, c] += cube[r+k, c+k, k]
    return grid
def diagonal_pack(cube: np.ndarray) -> np.ndarray:
    h, w = hw = cube.shape[:2]
    grid = np.zeros(hw, dtype=cube.dtype)
    for seg in range(1, 1 + w):
        idx_mat = np.arange(w - seg + 1)[:, np.newaxis] + np.arange(seg)
        tail = range(w - seg, w)
        head = range(seg)
        grid[w - seg, range(1 + w - seg)] = cube[
            tail, idx_mat, head,
        ].sum(axis=1)
        grid[range(w - seg), w - seg] = cube[
            idx_mat[:-1, :], tail, head,
        ].sum(axis=1)
    return grid
def diag_trace(cube: np.ndarray) -> np.ndarray:
    N = cube.shape[0]
    grid = np.zeros((N, N))
    for r in range(N):
        for c in range(N):
            u = cube[r:, c:, :]
            grid[r, c] = np.trace([u[k, k] for k in range(min(u.shape))])
    return grid
def run_check() -> None:
    rng = np.random.default_rng(seed=0)
    cube = rng.integers(low=0, high=10, size=(12, 12, 12))
    g1 = pairwise_scan(cube)
    g2 = diagonal_pack(cube)
    g3 = diag_trace(cube)
    assert np.array_equal(g1, g2)
    assert np.array_equal(g1, g3.astype(cube.dtype))
def timeit_run() -> None:
    rng = np.random.default_rng(seed=0)
    cube = rng.integers(low=0, high=99, size=(100, 100, 100))
    for fn in (diagonal_pack, diag_trace):
        t0 = time.perf_counter()
        fn(cube)
        t1 = time.perf_counter()
        print(f'{fn.__name__}: {t1-t0:.4f}')
if __name__ == '__main__':
    run_check()
    timeit_run()

On a representative run, the timings were:

semivectorised: 0.0082
literal_trace: 0.2085

This demonstrates that building and summing over preformed indices is significantly faster than a literal diagonal tracing approach in this context.

When the operation repeats: cache indices and reuse buffers

In many real workloads the same computation is invoked repeatedly with constant shape and dtype. A common pattern is to form a by multiplying a precomputed x with a per-call y reshaped as y[..., np.newaxis]. When the tensor shape and dtype are invariant across calls, you can precompute the indexing once and reuse temporary buffers to reduce allocation overhead.

import typing
import numpy as np
class DiagRunner(typing.NamedTuple):
    n: int
    buf3: np.ndarray
    out2: np.ndarray
    slots: tuple[tuple, ...]
    @classmethod
    def from_like(cls, arr: np.ndarray) -> typing.Self:
        return cls.make(n=arr.shape[0], dtype=arr.dtype)
    @classmethod
    def make(cls, n: int, dtype: np.dtype) -> typing.Self:
        ir = np.arange(1 + n, dtype=np.int32)
        ij = ir[:, np.newaxis] + ir
        return cls(
            n=n,
            buf3=np.empty((n, n, n), dtype=dtype),
            out2=np.empty((n, n), dtype=dtype),
            slots=tuple([
                (
                    (tail := range(n - seg, n), ij[:n - seg + 1, :seg], head := range(seg)),
                    (ij[:n - seg, :seg], tail, head),
                    (n - seg, range(1 + n - seg)),
                    (range(n - seg), n - seg),
                )
                for seg in range(1, 1 + n)
            ]),
        )
    def apply(self, x: np.ndarray, y: np.ndarray) -> np.ndarray:
        a, b = self.buf3, self.out2
        np.multiply(x, y[..., np.newaxis], out=a)
        for av1, av2, bv1, bv2 in self.slots:
            b[bv1] = a[av1].sum(axis=1)
            b[bv2] = a[av2].sum(axis=1)
        return b
# usage
# runner = DiagRunner.from_like(x)
# result = runner.apply(x, y)

This strategy aligns with workloads where the computation is executed 10²–10⁴ times, the shape is around the order of 10² per axis, and the data type is a complex float, while x is shared and y varies per call. By caching the indexing plan and reusing the same work buffers, you minimise both Python-level overhead and per-call allocations.

Why this matters

Understanding how to vectorise non-rectangular reductions is a practical skill for high-performance NumPy. It lets you stay in pure NumPy without resorting to cython. For repeated evaluations, pushing the indexing and memory layout decisions outside the hot path yields consistent speed-ups and avoids turning indexing itself into the bottleneck.

Takeaways

If you need b[i, j] = Σ_k a[i+k, j+k, k], avoid per-element loops and slicing. Build index sets per diagonal length, sum along the last axis, and write into the appropriate rows and columns of the output. When running the same operation many times with constant shape and dtype, precompute the indexing and reuse buffers. Validate correctness against a simple reference and profile on your actual data sizes to confirm the expected gains.

numpy numpy-einsum python