2025, Dec 07 01:00

Working with Jagged Data in NumPy Object Arrays: np.vectorize pitfalls, faster loops, frompyfunc

Learn why np.vectorize breaks on NumPy object arrays with jagged data, how to fix with otypes or frompyfunc, and why simple Python loops are often faster.

Working with jagged data inside NumPy can be tempting, especially when you want array semantics with lists of different lengths. A common pattern is to keep Python lists in an object dtype array and try to apply operations “vectorized-style.” That tends to break in surprising ways. Here is a minimal, reproducible guide to what goes wrong, why it happens, and what actually works.

Reproducing the issue

Suppose we have an object array holding 500 Python lists of varying lengths, and we want to append the value 100 to a subset of rows at once.

import numpy as np

arr_jag = np.array([[t for t in range(np.random.randint(10))] for _ in range(500)], dtype=object)

targets = [0, 10, 20, 30, 40, 50]

v_add = np.vectorize(lambda seq: seq + [100])

# Raises: ValueError: setting an array element with a sequence
arr_jag[targets] = v_add(arr_jag[targets])

The code tries to “broadcast” a Python-level operation over a selection of lists. Instead, it fails with ValueError.

Why this fails

The core issue is twofold. First, NumPy arrays with dtype=object store arbitrary Python objects; there is no elementwise vectorized arithmetic for lists. Second, np.vectorize does not introduce low-level vectorization; it is essentially a thin loop that calls your Python function repeatedly. When its output type is not explicitly specified for object payloads, constructing the result array can misalign with your target, leading to the assignment error you see. In short, without telling np.vectorize to return object elements, your assignment into an object array collides with NumPy’s attempt to form a regular ndarray from sequences of different sizes.

A working approach with np.vectorize

You can make the vectorized call assignable by forcing an object return type. That aligns the right-hand side with the left-hand-side selection.

import numpy as np

arr_jag = np.array([[t for t in range(np.random.randint(10))] for _ in range(500)], dtype=object)

targets = [0, 10, 20, 30, 40, 50]

add_tail = lambda seq: seq + [100]
vec_obj = np.vectorize(add_tail, otypes=[object])

arr_jag[targets] = vec_obj(arr_jag[targets])

This preserves the intended behavior: each selected list gets a 100 appended, and the assignment succeeds because the result is an object array.

The simple loop is faster

Even though the name suggests otherwise, np.vectorize is still a Python-level loop with overhead. A direct loop over the indices is often faster when you are mutating Python objects inside an object array.

import numpy as np

arr_jag = np.array([[t for t in range(np.random.randint(10))] for _ in range(500)], dtype=object)

targets = [0, 10, 20, 30, 40, 50]

add_tail = lambda seq: seq + [100]

for pos in targets:
    arr_jag[pos] = add_tail(arr_jag[pos])

Empirical comparisons show that the explicit loop is quite a bit faster than np.vectorize for this use case. Using np.frompyfunc to build a Python ufunc can be faster than np.vectorize, though iteration still comes out ahead on the same task. The following timings illustrate the relative performance on a smaller sample and selection:

np.vectorize(..., otypes=[object]): about 19.4 μs per call; explicit for-loop: about 2 μs per loop; np.frompyfunc: about 9.34 μs per call. Iteration is faster in this scenario.

An alternative API with np.frompyfunc

If you prefer a ufunc-style call signature, np.frompyfunc provides that without changing the underlying cost model. It returns an object array and is generally lighter than np.vectorize.

import numpy as np

arr_jag = np.array([[t for t in range(np.random.randint(10))] for _ in range(500)], dtype=object)

targets = [0, 10, 20, 30, 40, 50]

add_tail = lambda seq: seq + [100]
apply_py = np.frompyfunc(add_tail, 1, 1)

arr_jag[targets] = apply_py(arr_jag[targets])

This produces an object array suitable for assigning back into arr_jag at the chosen indices.

Why this matters

NumPy shines with fixed-size, homogeneous, numeric arrays. Arrays of lists are stored as Python objects, and operations on them run in Python space. That means vectorize-like wrappers do not unlock C-level speedups. For jagged arrays and irregular data, consider structures designed for that shape. For example, representing ragged data in a flattened form plus start–end indices can be efficient to process, and dedicated libraries are better suited for irregular layouts. There are also sparse representations that mirror this “array of objects” concept. In particular, the scipy.sparse LIL format uses two object-dtype arrays internally and is convenient for iterative construction, although it is not the most compact or best for computations; conversions between sparse formats are readily available.

Conclusion

If you keep Python lists inside a NumPy object array and need to append or transform selected entries, ensure any vectorized wrapper returns object elements, or simply use a clear loop. When performance matters, direct iteration often wins for this pattern, and building a Python ufunc via np.frompyfunc can be a pragmatic alternative with an array API. For large-scale, irregular data workflows, rethink the container: either encode ragged structures in flat form with offsets, adopt a library designed for jagged arrays, or use a sparse representation when that matches the computation model.