2025, Dec 14 23:00

Compute minimal L1 distances between arrays with pure NumPy: vectorized broadcasting, axis-wise reductions, no Python loops

Learn how to compute a similarity score via minimal L1 distances between arrays using NumPy broadcasting and reductions—no Python loops, faster and cleaner.

Vectorizing similarity computation across arrays with different row counts is a common pattern in data processing pipelines. The goal is simple: for each row in one array, find the minimal L1 distance to any row in another array and sum these minima into a single score. The straightforward approach works, but it leans on Python loops. There is a clean NumPy-native solution that removes the Python overhead and makes the computation noticeably faster.

Example: the baseline implementation

The following code evaluates a similarity score between two NumPy arrays of different sizes. For every row in the second array it computes L1 distances to all rows of the first array, takes the smallest, and finally sums these minima.

import numpy as np
from numpy.linalg import norm as l1norm
x_mat = np.array([(1, 2, 3), (1, 4, 9), (2, 4, 4)])
y_mat = np.array([(1, 3, 3), (1, 5, 9)])
score = sum([min(l1norm(x_mat - row_vec, ord=1, axis=1)) for row_vec in y_mat])

What’s going on under the hood

The logic is consistent and easy to follow. Subtracting a single row from the first array leverages broadcasting to get per-row differences, the L1 norm with axis=1 reduces those differences into distances, and taking the minimum picks the closest row. Doing this inside a Python comprehension repeats the process for each row of the second array and then aggregates the result with a sum. The only drawback is the explicit Python-level loop, which limits efficiency.

The NumPy-only expression

The loop can be removed entirely by constructing a pairwise distance matrix with broadcasting and reducing it along the right axes. This keeps the same math while pushing all the work into vectorized operations.

import numpy as np
from numpy.linalg import norm as l1norm
x_mat = np.array([(1, 2, 3), (1, 4, 9), (2, 4, 4)])
y_mat = np.array([(1, 3, 3), (1, 5, 9)])
score_opt = l1norm(x_mat[:, None, :] - y_mat[None], ord=1, axis=2).min(axis=0).sum()

The slice x_mat[:, None, :] and y_mat[None] introduce singleton dimensions so that subtraction broadcasts into a three-dimensional array of pairwise differences. The L1 norm with axis=2 collapses the last dimension into distances, resulting in a matrix where each column corresponds to one row of the second array. Taking min(axis=0) selects the smallest distance per column, exactly mirroring the earlier logic of choosing the best match per row. The final sum aggregates those minima into the same scalar score. Using min(axis=0) keeps the intent explicit; it is equivalent to min(0).

Why this matters

Sticking to NumPy APIs for array-wide computations reduces Python overhead and takes advantage of optimized vectorized kernels. In practical terms, this approach is significantly faster than the loop-based version and even edges out a straightforward Numba rewrite for the same task.

Takeaway

When you need a similarity score based on minimal L1 distances across rows of two arrays with different sizes, broadcasting plus axis-wise reductions is a direct fit. Shape the arrays to a pairwise difference tensor, apply the norm along the feature axis, take the column-wise minima, and sum. Keep the axis keyword explicit for readability, and resist dropping back to Python loops when NumPy can express the whole pipeline in a single, clear expression.