2025, Sep 23 05:00

Silent Data Corruption in Large NumPy/Numba complex128 Arrays: A Hardware Case Study

Intermittent wrong maxima in NumPy/Numba complex128 arrays? This case study traces silent data corruption to hardware memory errors; MemTest86 confirms, ECC

Silent data corruption in large numerical arrays is one of those bugs that looks like a software edge case but ends up being a hardware cliff. If you work with NumPy, Numba, and complex128 arrays in memory-heavy simulations, you may run into scenarios where results look randomly wrong without any crash, exception, or warning. Here’s a case study of how that surfaced and what ultimately fixed it.

Context: unexpected corruption in complex128 arrays

The workload runs on a Linux machine with 128 GB RAM and pushes memory hard with multiple arrays around 26 GB. Arrays are complex128. Allocation succeeds, assignments succeed, memory consumption looks right. But when querying extrema, real parts behave, while imaginary parts sometimes return wildly wrong maxima. The minima are usually plausible but incorrect; the maxima are often near the double-precision limit, occasionally NaN, never Inf. Overwriting the suspicious element doesn’t change its value. The issue reproduces more often under high memory pressure and irregularly across runs.

Reproducer: NumPy + Numba on large complex arrays

The following example demonstrates the behavior by scanning re/im extrema across slices. The logic is unchanged; names are different for readability.

import numpy as np
import numba


@numba.jit(cache=True)
def extrema_re_im(arr):
    r_hi = arr[0].real
    r_lo = arr[0].real
    i_hi = arr[0].imag
    i_lo = arr[0].imag

    for z in arr[1:]:
        if z.real > r_hi:
            r_hi = z.real
        elif z.real < r_lo:
            r_lo = z.real
        if z.imag > i_hi:
            i_hi = z.imag
        elif z.imag < i_lo:
            i_lo = z.imag
    return (r_lo, r_hi, i_lo, i_hi)


n_tau = 2048
side = 1024

mesh = np.empty((side, side), dtype=complex)
mesh.real[:] = np.random.rand(*mesh.shape)[:]
mesh.imag[:] = np.random.rand(*mesh.shape)[:]
weights = np.empty(n_tau, dtype=complex)
weights.real[:] = np.random.rand(*weights.shape)[:]
weights.imag[:] = np.random.rand(*weights.shape)[:]
plane = mesh[:, :] * weights[0]
(rmin, rmax, imin, imax) = extrema_re_im(plane.flatten())

cube = np.empty((n_tau, side, side), dtype=complex)
for t in range(n_tau):
    plane = mesh[:, :] * weights[t]
    (rmin2, rmax2, imin2, imax2) = extrema_re_im(plane.flatten())
    if rmin2 < rmin:
        rmin = rmin2
    elif rmax2 > rmax:
        rmax = rmax2
    if imin2 < imin:
        imin = imin2
    elif imax2 > imax:
        imax = imax2

    cube[t] = plane[:, :]

print((rmin, rmax, imin, imax))
print(extrema_re_im(cube.flatten()))

There’s also a smaller but still indicative snippet that fills a large 3D array and reads the imaginary max directly:

import numpy as np

n_tau = 2048
side = 1024

cube = np.empty((n_tau, side, side), dtype=complex)
cube.real[:] = np.random.rand(*cube.shape)[:]
cube.imag[:] = np.random.rand(*cube.shape)[:]

print(np.max(cube.imag))

What’s actually going wrong

The symptoms point to non-deterministic corruption that appears under high memory pressure, affects specific bits in floating-point representations, and doesn’t trigger Python exceptions or OS faults. Real parts remain consistent while imaginary parts occasionally spike to near-maximum float64 values, sometimes with characteristic bit patterns. Attempts to overwrite affected values don’t stick.

The root cause in this case was not NumPy, Numba, or how the arrays were used. Running MemTest86 revealed persistent memory errors—so many that the run stopped after exceeding 100000 errors. The failures were single-bit and repeatedly appeared, including in the first two bytes. Every combination of the four memory modules across the four DIMM slots produced failures, indicating the system was not reliable at the hardware level under this workload. The machine had 128 GB of non-ECC UDIMM on a Ryzen 9 3900X. With all modules and slots exhibiting errors, the likely suspects were narrowed to the PSU or the CPU’s memory controller, pending further swap tests.

Resolution: verify memory reliability before debugging software

The practical fix here was to validate the system with a dedicated memory test and treat the results as authoritative. MemTest86 surfaced extensive single-bit errors. That confirmed the corruption was hardware-induced and explained why the bug was intermittent, appeared at scale, and didn’t throw software exceptions.

It’s worth noting that ECC protects against unexpected random bit flips and can report when it’s correcting errors or when it can’t correct them, which makes fault localization much easier. It doesn’t make failing memory cells good, but it helps you detect and diagnose instead of silently corrupting data.

Because the issue is hardware, there’s no code change that fixes it. The examples above are correct as written; they simply expose how silent memory corruption manifests in large complex128 arrays.

Why this matters for numerical work

Large-scale numerical simulations can stress every part of a system, from DRAM timing margins to the CPU’s memory controller. Silent corruption can masquerade as a logic bug in your kernels or a corner case in your math. When arrays grow into tens of gigabytes and results turn sporadic without exceptions, it’s critical to rule out hardware before refactoring software or redesigning algorithms.

Takeaways

If extrema, norms, or other reductions intermittently return absurd values on large arrays, especially in only one component like the imaginary part, suspect the platform. Validate the machine with a tool like MemTest86. If errors are present, stop chasing software ghosts. Consider reliability strategies appropriate for your workload, including running diagnostics, component swaps to isolate PSU, CPU memory controller, DIMMs, or board, and using ECC when silent corruption risk is unacceptable.

In short, when your complex128 arrays start reporting impossible maxima and won’t accept writes, don’t overfit a software explanation. Prove the hardware first.

The article is based on a question from StackOverflow by laserpropsims and an answer by laserpropsims.