2025, Sep 23 19:00
Fast NumPy updates for variable-length row slices: keep the simple loop and compile it with Numba @njit
Learn how to speed up ragged per-row writes in NumPy by compiling the loop with Numba @njit. Avoid vectorization and achieve up to 10x faster performance.
Filling large NumPy arrays efficiently gets tricky when each row needs a different number of elements updated. If you try to vectorize such a pattern, you quickly run into indexing limits or obscure broadcasting pitfalls. A straightforward Python loop works but can be painfully slow when the index set is very large (think much greater than ten million iterations). Here’s a clean way to make it fast without changing the algorithm.
Problem statement
You have an m×n array and three 1×m arrays that drive the fill. For each index from a pre-sorted list, you compute a value and write it into a prefix slice of a row selected by an angle bin. Conceptually, this looks like “for a given angle row, write the same value into columns [0:elev)”. Attempting to broadcast this without a loop leads to issues such as non-integer indices or incompatible shapes. A direct loop works, but with tens of millions of iterations it becomes a bottleneck.
Baseline code that works but is slow
The following loop demonstrates the update pattern that produces correct results but takes too long on very large inputs.
import numpy as np
# grid: m x n array
# heights, radii, ang_bins: 1 x m arrays
# order_idx: index array (sorted elsewhere)
def fill_naive(grid, order_idx, heights, radii, ang_bins):
    for j in order_idx:
        stop = heights[j]
        val = 1 - (radii[j] / 10813)
        grid[int(ang_bins[j]), 0:stop] = val
When trying to remove the loop, you might reach for something like a direct advanced index on ragged slices and run into the well-known constraint of NumPy indexing.
IndexError: arrays used as indices must be of integer (or boolean) type
This happens because the update requires a different column range per row, and standard NumPy indexing doesn’t accept per-row variable-length slices in a single expression.
Why vectorization is hard here
The core of the operation is a ragged write: each selected row is filled only up to a per-row boundary given by an elevation value. That means you don’t have a rectangular block to assign in one shot, and you also need integer indices for rows. These two constraints together make the obvious vectorized approaches either invalid or awkward, and they don’t eliminate the loop without substantial restructuring.
Fast and simple: compile the loop with Numba
You don’t need to abandon the loop. Instead, compile it with Numba’s @njit to remove Python overhead while preserving the same logic. This approach keeps the code readable and delivers a large speedup.
from numba import njit
@njit
def paint_spans(canvas, idx_sorted, elev_vec, rad_vec, theta_vec):
    for j in idx_sorted:
        end = elev_vec[j]
        value = 1 - (rad_vec[j] / 10813)
        canvas[int(theta_vec[j]), 0:end] = value
Call this function with your arrays and the pre-sorted index list. The algorithm stays the same; the loop runs at native speed. In practice this already shows about a 10× improvement.
Why this matters
When the index list is on the order of many millions, interpreter overhead dominates even though each iteration is simple. The data access pattern here doesn’t map neatly to pure NumPy vectorization because each row has a different span to fill. JIT-compiling the loop with Numba sidesteps both problems while keeping the codebase compact and maintainable.
Takeaways
If you need to assign variable-length slices per row, don’t force vectorization that doesn’t fit. Keep the clear loop, but JIT-compile it. Ensure row indices are integers, keep the per-row slice as [0:elev), and compute the fill value as 1 - (r / 10813). With @njit, you retain correctness and achieve the performance you need without contortions.
The article is based on a question from StackOverflow by Hank Golding and an answer by Aadvik.