https://pytroubles.com/en/posts/id79-optimize-numpy-ragged-row-updates-skip-vectorization-and-jit-compile-the-loop-with-numba-njit

Optimize NumPy ragged row updates: skip vectorization and JIT-compile the loop with Numba @njit

Fast NumPy updates for variable-length row slices: keep the simple loop and compile it with Numba @njit

Optimize NumPy ragged row updates: skip vectorization and JIT-compile the loop with Numba @njit

Learn how to speed up ragged per-row writes in NumPy by compiling the loop with Numba @njit. Avoid vectorization and achieve up to 10x faster performance.

2025-09-23T19:00:05+03:00

Filling large NumPy arrays efficiently gets tricky when each row needs a different number of elements updated. If you try to vectorize such a pattern, you quickly run into indexing limits or obscure broadcasting pitfalls. A straightforward Python loop works but can be painfully slow when the index set is very large (think much greater than ten million iterations). Here’s a clean way to make it fast without changing the algorithm.Problem statementYou have an m×n array and three 1×m arrays that drive the fill. For each index from a pre-sorted list, you compute a value and write it into a prefix slice of a row selected by an angle bin. Conceptually, this looks like “for a given angle row, write the same value into columns [0:elev)”. Attempting to broadcast this without a loop leads to issues such as non-integer indices or incompatible shapes. A direct loop works, but with tens of millions of iterations it becomes a bottleneck.Baseline code that works but is slowThe following loop demonstrates the update pattern that produces correct results but takes too long on very large inputs.import numpy as np # grid: m x n array # heights, radii, ang_bins: 1 x m arrays # order_idx: index array (sorted elsewhere) def fill_naive(grid, order_idx, heights, radii, ang_bins): for j in order_idx: stop = heights[j] val = 1 - (radii[j] / 10813) grid[int(ang_bins[j]), 0:stop] = val When trying to remove the loop, you might reach for something like a direct advanced index on ragged slices and run into the well-known constraint of NumPy indexing.IndexError: arrays used as indices must be of integer (or boolean) typeThis happens because the update requires a different column range per row, and standard NumPy indexing doesn’t accept per-row variable-length slices in a single expression.Why vectorization is hard hereThe core of the operation is a ragged write: each selected row is filled only up to a per-row boundary given by an elevation value. That means you don’t have a rectangular block to assign in one shot, and you also need integer indices for rows. These two constraints together make the obvious vectorized approaches either invalid or awkward, and they don’t eliminate the loop without substantial restructuring.Fast and simple: compile the loop with NumbaYou don’t need to abandon the loop. Instead, compile it with Numba’s @njit to remove Python overhead while preserving the same logic. This approach keeps the code readable and delivers a large speedup.from numba import njit @njit def paint_spans(canvas, idx_sorted, elev_vec, rad_vec, theta_vec): for j in idx_sorted: end = elev_vec[j] value = 1 - (rad_vec[j] / 10813) canvas[int(theta_vec[j]), 0:end] = value Call this function with your arrays and the pre-sorted index list. The algorithm stays the same; the loop runs at native speed. In practice this already shows about a 10× improvement.Why this mattersWhen the index list is on the order of many millions, interpreter overhead dominates even though each iteration is simple. The data access pattern here doesn’t map neatly to pure NumPy vectorization because each row has a different span to fill. JIT-compiling the loop with Numba sidesteps both problems while keeping the codebase compact and maintainable.TakeawaysIf you need to assign variable-length slices per row, don’t force vectorization that doesn’t fit. Keep the clear loop, but JIT-compile it. Ensure row indices are integers, keep the per-row slice as [0:elev), and compute the fill value as 1 - (r / 10813). With @njit, you retain correctness and achieve the performance you need without contortions.

NumPy, Numba, @njit, vectorization, ragged slices, variable-length rows, per-row updates, advanced indexing, broadcasting pitfalls, Python performance, JIT compilation, optimize NumPy loop

2025

2025, Sep 23 19:00

Fast NumPy updates for variable-length row slices: keep the simple loop and compile it with Numba @njit

Learn how to speed up ragged per-row writes in NumPy by compiling the loop with Numba @njit. Avoid vectorization and achieve up to 10x faster performance.

Problem statement

You have an m×n array and three 1×m arrays that drive the fill. For each index from a pre-sorted list, you compute a value and write it into a prefix slice of a row selected by an angle bin. Conceptually, this looks like “for a given angle row, write the same value into columns [0:elev)”. Attempting to broadcast this without a loop leads to issues such as non-integer indices or incompatible shapes. A direct loop works, but with tens of millions of iterations it becomes a bottleneck.

Baseline code that works but is slow

The following loop demonstrates the update pattern that produces correct results but takes too long on very large inputs.

import numpy as np
# grid: m x n array
# heights, radii, ang_bins: 1 x m arrays
# order_idx: index array (sorted elsewhere)
def fill_naive(grid, order_idx, heights, radii, ang_bins):
    for j in order_idx:
        stop = heights[j]
        val = 1 - (radii[j] / 10813)
        grid[int(ang_bins[j]), 0:stop] = val

When trying to remove the loop, you might reach for something like a direct advanced index on ragged slices and run into the well-known constraint of NumPy indexing.

IndexError: arrays used as indices must be of integer (or boolean) type

This happens because the update requires a different column range per row, and standard NumPy indexing doesn’t accept per-row variable-length slices in a single expression.

Why vectorization is hard here

The core of the operation is a ragged write: each selected row is filled only up to a per-row boundary given by an elevation value. That means you don’t have a rectangular block to assign in one shot, and you also need integer indices for rows. These two constraints together make the obvious vectorized approaches either invalid or awkward, and they don’t eliminate the loop without substantial restructuring.

Fast and simple: compile the loop with Numba

You don’t need to abandon the loop. Instead, compile it with Numba’s @njit to remove Python overhead while preserving the same logic. This approach keeps the code readable and delivers a large speedup.

from numba import njit
@njit
def paint_spans(canvas, idx_sorted, elev_vec, rad_vec, theta_vec):
    for j in idx_sorted:
        end = elev_vec[j]
        value = 1 - (rad_vec[j] / 10813)
        canvas[int(theta_vec[j]), 0:end] = value

Call this function with your arrays and the pre-sorted index list. The algorithm stays the same; the loop runs at native speed. In practice this already shows about a 10× improvement.

Why this matters

When the index list is on the order of many millions, interpreter overhead dominates even though each iteration is simple. The data access pattern here doesn’t map neatly to pure NumPy vectorization because each row has a different span to fill. JIT-compiling the loop with Numba sidesteps both problems while keeping the codebase compact and maintainable.

Takeaways

If you need to assign variable-length slices per row, don’t force vectorization that doesn’t fit. Keep the clear loop, but JIT-compile it. Ensure row indices are integers, keep the per-row slice as [0:elev), and compute the fill value as 1 - (r / 10813). With @njit, you retain correctness and achieve the performance you need without contortions.

The article is based on a question from StackOverflow by Hank Golding and an answer by Aadvik.

numpy python