2025, Oct 01 01:00
Fast vectorized grouping in NumPy: group by key and sum values with unique(return_inverse) and add.at
Learn how to group sparse ndarray keys and sum values in NumPy using unique(return_inverse) and add.at. Avoid bincount blowups and speed up aggregation.
When you have an ndarray of pairs like [curveLocations, distinctCrossings] and some curveLocations repeat, the task is to consolidate rows by that key and sum the associated counts. Sorting is not a requirement, but a fast, vectorized grouping is. A common detour is numpy.bincount, which is tempting but can be disastrous for sparse, high-valued keys. The right fit here is numpy.unique with return_inverse plus a scatter-add via numpy.add.at.
Problem setup
The input is an unsorted 2D ndarray of unsigned integers. The first column stores curveLocations (keys), the second contains distinctCrossings (values to sum). If a key repeats, its values must be aggregated. The output does not have to be sorted.
Keys are sparse, and the maximum key can be very large. As the docs state,
"The length of out is equal to np.amax(x)+1."
That makes numpy.bincount a poor match here because the number of bins would explode when the maximum grows, while most bins remain empty. In the referenced scenario, the maximum key is bounded by curveLocationsMAXIMUM = 1 << (2 * n + 4). With n = 25, you end up near 2^54 bins — clearly not viable.
Code example that demonstrates the issue
The following snippet shows a vectorized consolidation using a sorted set of unique keys and a separate look-up step per input row. It mirrors the approach that relies on numpy.unique for extracting keys and numpy.searchsorted to map each original row to its group index. The identifiers are chosen for clarity; the behavior is the same as the well-known pattern.
import numpy as np
def pack_by_location(pairs2d: np.ndarray) -> np.ndarray:
    # pairs2d has shape (N, 2): [:, 0] are keys (curveLocations), [:, 1] are values (distinctCrossings)
    unique_keys = np.unique(pairs2d[:, 0])
    # Prepare a 2-column result: [key, sum]
    grouped = np.tile(unique_keys, (2, 1)).T
    grouped[:, 1] = 0
    # Map each input key to its position among unique_keys, then scatter-add the values
    idx_map = np.searchsorted(grouped[:, 0], pairs2d[:, 0])
    np.add.at(grouped[:, 1], idx_map, pairs2d[:, 1])
    return grouped
This uses two vectorized passes: one to extract and sort keys, and one to search for the index of each original row among those keys. It works, but the combination of numpy.unique and numpy.searchsorted can dominate runtime, as seen in profiling that attributed most of the cost to these functions in aggregateCurveLocations.
Why this happens
The consolidation itself is straightforward: map each original row to a group index and add its value to the corresponding bucket. The nuance is how to build that mapping efficiently. numpy.bincount cannot be used because it allocates up to the maximum key, which is enormous for sparse, wide key spaces. The search approach in the previous snippet is correct but performs an extra pass of binary searches after generating sorted unique keys.
numpy.unique can provide exactly the mapping you need via return_inverse. That “inverse” array is a direct index into the deduplicated keys for every element of the input, avoiding a separate search step. Once you have those indices, numpy.add.at performs an efficient scatter-add to accumulate the sums.
Solution
The function below consolidates by key using numpy.unique(return_inverse=True) and numpy.add.at. It avoids the extra numpy.searchsorted pass while keeping the memory footprint tight and independent of key magnitude.
import numpy as np
def collapse_pairs(pairs2d: np.ndarray) -> np.ndarray:
    # pairs2d has shape (N, 2): [:, 0] are keys (curveLocations), [:, 1] are values (distinctCrossings)
    keys, rev_idx = np.unique(pairs2d[:, 0], return_inverse=True)
    totals = np.zeros_like(keys)
    np.add.at(totals, rev_idx, pairs2d[:, 1])
    return np.column_stack((keys, totals))
This produces a two-column result of [curveLocations, summedDistinctCrossings]. By default, numpy.unique returns keys in sorted order, which is acceptable here because the output does not need to preserve the original ordering.
Extending beyond two columns
If additional columns are required after consolidation, compute them on the grouped result. For example, bitwise extractions like groupZulu and groupAlpha can be derived from the consolidated curveLocations using the same masks and shifts as before. The consolidation step is orthogonal to those calculations and simply replaces the slower “grouping by search” routine.
Why you want this in your toolkit
Grouping by key and summing values is a foundational operation in data-intensive code, and getting it right at the ndarray level matters. Sparse key spaces rule out bin-counting tricks; per-row Python loops are out of the question; repeated searches add overhead. The unique + inverse + add.at pattern is compact, avoids huge allocations, and scales cleanly with large, unsorted inputs.
Takeaways
For deduplication by consolidation in NumPy, express the problem as “build a group index per row, then scatter-add.” When keys are sparse and potentially huge, skip numpy.bincount. Use numpy.unique with return_inverse=True to form the group index in one shot, and accumulate with numpy.add.at. If your pipeline includes post-aggregation fields derived from curveLocations, compute them on the grouped result exactly as before.
The article is based on a question from StackOverflow by hunterhogan and an answer by Warren Weckesser.