2025, Nov 23 23:00

numpy.fabs vs numpy.abs on float32 arrays: why the fabs path is slower and how to fix it

Learn why numpy.fabs is slower than numpy.abs on float32 arrays: C math library calls block SIMD and inlining. See benchmarks, notes, and the faster fix.

Why is numpy.fabs noticeably slower than numpy.abs on the same float32 array, even though numpy.abs supports more types and features? The short answer: one of them takes a detour through the C math library, and that detour blocks the optimizations you actually want.

Minimal reproducer

The following script demonstrates the timing gap on a float32 array. It measures microsecond-scale latency per call using timeit.

import numpy as np, timeit as tm

buf = np.random.rand(1000).astype(np.float32)
print('Minimum, median and maximum execution time in us:')

for expr in ('np.fabs(buf)', 'np.abs(buf)'):
    timings = 10**6 * np.array(tm.repeat(stmt=expr, setup=expr, globals=globals(), number=1, repeat=999))
    print(f'{expr:20}  {np.amin(timings):8,.3f}  {np.median(timings):8,.3f}  {np.amax(timings):8,.3f}')

Representative output on an AMD Ryzen 7 3800X shows numpy.fabs more than 2x slower than numpy.abs for the same data size.

What is actually happening

The root cause is not about correctness or edge cases in NumPy user code. It is about which implementation path NumPy chooses under the hood for floating point absolute value.

fabs always calls the C math library function of the same name (for float32, the fabsf variation). Therefore the operation cannot be inlined or vectorized.

This has been verified by interposing a custom implementation via LD_PRELOAD. In other words, numpy.fabs pays the cost of an external library call. That choice shuts the door on inlining and on the kind of vectorization that makes array code fly.

On glibc, fabsf maps to __builtin_fabsf(x), and the generated code is not intrinsically more complex than a fast absolute value bit operation. The point is not that the library is slow, but that calling into it prevents the compiler and NumPy’s fast paths from doing their best work.

NumPy appears to consistently route the f… family (fabs, fmin, fmax) through the C math library. Similar effects can therefore be expected for fmin and fmax compared to plain min and max semantics in array operations, though fmin and fmax do carry additional behavior compared to the simplest min or max.

There is also a platform dimension. Old reports show that the performance difference and signaling math behavior are not universal. On MIPS, abs used to be slower because the compiler could not safely turn the generic C expression into a bit mask due to potential floating point exceptions, while fabs is not supposed to raise them.

Finally, absolute value for floats has a subtle correctness trap: implementing it as x < 0 ? -x : x breaks on negative zero, because you must preserve the distinction between -0 and +0 for IEEE-754 semantics. Modern NumPy makes numpy.abs behave correctly for floating point types, whereas naïve C would not. That helps explain why a simple homegrown expression is not a drop-in replacement.

Practical fix

For array workloads, prefer numpy.abs over numpy.fabs. This keeps the operation on the vectorized, inlineable path that NumPy can optimize well.

import numpy as np

buf = np.random.rand(1000).astype(np.float32)
res = np.abs(buf)

If you time larger arrays, the gap can widen substantially. Reports include double-digit speedups at 100K elements, and on some CPUs a 100x larger array magnified the difference from modest to very large. It has also been noted that AVX2 can accelerate float32 absolute value by up to 8x, which aligns with the idea that staying on the vectorized path matters.

Why this matters for performance work

Array code performance often hinges on whether operations can be fused, inlined, and vectorized. Routing through a C math library function like fabsf rules that out, so even a semantically minimal operation can turn into a bottleneck in hot loops. The opposite is true for numpy.abs, which modern NumPy implements in a way that respects floating point corner cases while still allowing fast execution.

Keep in mind that behavior and speed can vary across platforms and toolchains. Historical cases such as MIPS show that compiler capabilities and exception semantics can flip the performance story. The takeaway is not that one name is universally faster everywhere, but that the library call boundary in numpy.fabs is a consistent limiter on platforms where SIMD and inlining are available.

Conclusion

If you need absolute values for NumPy arrays, reach for numpy.abs. It preserves IEEE-754 details like negative zero and typically lands on an inlinable, vectorizable path. Be cautious about using the f… family when you care about throughput, and remember that floating point behavior can be platform dependent. When in doubt, benchmark on your target hardware with realistic array sizes, and prefer APIs that let NumPy apply its optimized kernels.

numpy performance python