2025, Nov 12 07:00

Partial Trace in NumPy Without Building the Density Matrix: A Memory-Safe Vector Method

Hit numpy._core._exceptions._ArrayMemoryError building a 91204x91204 density matrix? Compute the partial trace from the vector in NumPy and avoid 124 GiB RAM.

Building a full outer product of a large complex vector just to take a partial trace is a classic way to hit a wall with memory. If the state dimension is 91204 (complex128), the intermediate density matrix is 91204 × 91204 and immediately overflows RAM. The error says it all:

numpy._core._exceptions._ArrayMemoryError: Unable to allocate 124. GiB for an array with shape (91204, 91204) and data type complex128

The fix is to avoid materializing the full matrix and compute the partial trace directly from the vector. Below is a walkthrough of the problem, why it happens, and how to implement a memory-safe solution.

Naive approach that blows up memory

The following code constructs the full outer product and then tries to take a partial trace via np.einsum over a reshaped array. The logic is straightforward but memory-inefficient.

import numpy as np

def keep_trace(mat, stay, axes, use_opt=False):
    stay = np.asarray(stay)
    axes = np.asarray(axes)
    n_axes = axes.size
    kept_prod = np.prod(axes[stay])

    a_idx = [k for k in range(n_axes)]
    b_idx = [n_axes + k if k in stay else k for k in range(n_axes)]
    shaped = mat.reshape(np.tile(axes, 2))
    reduced = np.einsum(shaped, a_idx + b_idx, optimize=use_opt)
    return reduced.reshape(kept_prod, kept_prod)

nB = 150
vec_len = 4*(nB+1)**2  # 91204
phi_shape = (1, vec_len)

phi = np.random.uniform(-1, 1, phi_shape) + 1.j * np.random.uniform(-1, 1, phi_shape)
phi = phi / np.linalg.norm(phi)

density = np.outer(phi, phi.transpose().conj())
trace_kept = keep_trace(density, [0], [nB+1, 2, nB+1, 2], True)

Why it fails

The outer product density has quadratic memory growth in the state size. For a 91204-dimensional complex state, that’s a matrix with more than eight billion entries. Allocating it requires roughly 124 GiB for complex128, which is infeasible for most environments. On top of that, only a fraction of those elements are actually needed to compute the partial trace, so the vast majority of work and memory is wasted.

Efficient approach: compute the partial trace directly from the vector

Work with the state vector itself and skip the explicit density matrix. Let nB be given so the vector length is 4·(nB+1)². Conceptually, reshape that 1D vector into blocks of length v = 4·(nB+1), split the space into (nB+1) × (nB+1) blocks, and take the trace within each block. The resulting partial-trace element at position (i, j) equals the sum of elementwise products between segments of length v from the original vector and its conjugate:

P[i, j] = sum(phi[i*v : (i+1)*v] * phi.conj()[j*v : (j+1)*v])

That produces the same partial trace entries without ever creating the O(N²) matrix.

import numpy as np

nB = 150
phi_shape = (1, 4*(nB+1)**2)

phi = np.random.uniform(-1, 1, phi_shape) + 1.j * np.random.uniform(-1, 1, phi_shape)
phi = phi / np.linalg.norm(phi)

result = np.zeros((nB+1, nB+1), dtype=np.complex128)

block_len = 4*(nB+1)

for r in range(nB+1):
    for c in range(nB+1):
        result[r, c] = np.sum(
            phi[0, r*block_len:(r+1)*block_len] *
            phi.conj()[0, c*block_len:(c+1)*block_len]
        )

Why this matters

This technique removes the quadratic memory footprint and eliminates wasted calculations on elements that never contribute to the partial trace. It also keeps the computation within the cache-friendly realm of contiguous vector slices. A small practical tweak is to reuse intermediate slices in the r-loop and avoid recomputing conjugates repeatedly, which reduces overhead while preserving the exact same numerical result.

Takeaways

When a task involves an outer product purely as an intermediate step, ask whether the downstream operation can be expressed directly on the original vectors. For partial trace in this setting, summing over aligned segments of the state vector is equivalent to tracing submatrices of the outer product, but it avoids the 124 GiB allocation and makes the computation feasible on commodity hardware. Keep the data in its most compact form for as long as possible, and materialize large intermediate arrays only when they are strictly necessary.