https://pytroubles.com/en/posts/id1347-keras-dense-vs-numpy-matmul-why-gpu-cpu-paths-cause-1e-5-differences-and-how-to-match-results

Keras Dense vs NumPy Matmul: Why GPU/CPU Paths Cause 1e-5 Differences and How to Match Results

Matching Keras Dense and NumPy: CPU vs GPU numerical differences explained and how to get identical outputs

Keras Dense vs NumPy Matmul: Why GPU/CPU Paths Cause 1e-5 Differences and How to Match Results

Learn why Keras Dense and NumPy matmul can differ by 1e-5 due to GPU vs CPU execution, and how to force parity or compare with tolerance for reproducible tests.

2025-10-28T05:00:08+03:00

2025-10-28T05:00:09+03:00

Matching a simple Dense layer from TensorFlow/Keras with a pure NumPy forward pass looks trivial. Multiply an input matrix by the weight matrix, skip the bias, and you should get identical numbers. In practice, you may see tiny discrepancies around 1e-5, which is puzzling when the only visible operation is a matrix multiplication.Minimal example that exhibits the mismatchimport numpy as np import keras from keras import layers print("Keras version:", keras.__version__) print("Backend", keras.backend.backend()) # Build a tiny model src = layers.Input((2,), name='inp') dense_out = layers.Dense(5, kernel_initializer='random_normal', use_bias=False, name='dense')(src) toy_net = keras.Model(inputs=src, outputs=dense_out) # Random input feed = np.random.random(size=(5, 2)).astype(np.float32) # Keras forward pass y_keras = toy_net.predict(feed) # Extract the Dense kernel [weight_mat] = toy_net.layers[1].get_weights() # NumPy forward pass y_np = np.matmul(feed, weight_mat) # Compare print("Keras result:\n", y_keras) print("NumPy result:\n", y_np) print("Same result:", np.allclose(y_keras, y_np)) What actually causes the differenceThe observed gap does not come from custom math in the Dense layer or from an extra hidden operation. It comes from where the math runs. NumPy executes on CPU. TensorFlow/Keras may execute the same matmul on GPU. Different hardware paths use different implementations for the same mathematical operation, which produces slightly different floating‑point results. That is enough to yield small deviations in otherwise identical computations.Even with a small input of shape (5, 2) multiplied by a kernel of shape (2, 5), the accumulation of floating‑point operations on different execution units can lead to differences in the order of 1e-5 per element.How to align Keras with NumPyIf you run the Keras model on CPU, you will get the same numbers as NumPy for this case. Disabling CUDA forces TensorFlow/Keras to use the CPU path and removes the discrepancy.import os os.environ['CUDA_VISIBLE_DEVICES'] = '-1' import numpy as np import keras from keras import layers print("Keras version:", keras.__version__) print("Backend", keras.backend.backend()) src = layers.Input((2,), name='inp') dense_out = layers.Dense(5, kernel_initializer='random_normal', use_bias=False, name='dense')(src) toy_net = keras.Model(inputs=src, outputs=dense_out) feed = np.random.random(size=(5, 2)).astype(np.float32) y_keras = toy_net.predict(feed) [weight_mat] = toy_net.layers[1].get_weights() y_np = np.matmul(feed, weight_mat) print("Same result:", np.allclose(y_keras, y_np)) Why this matters for practitionersSmall numerical shifts can propagate across layers and affect tests, reproducibility checks, or regression thresholds. If you verify a forward pass against a reference NumPy implementation, the compute device behind your deep learning framework directly affects bitwise and tolerance‑based comparisons.TakeawaysIf you need parity between TensorFlow/Keras and NumPy for linear layers, run both on CPU or both on GPU. When that is not possible, compare with a tolerance and expect tiny differences, even for simple matrix multiplications. Understanding that the discrepancy comes from CPU versus GPU execution helps you choose the right environment for debugging, validating, and writing reproducible tests.

Keras Dense vs NumPy, TensorFlow GPU vs CPU, matmul differences, floating point discrepancy, forward pass parity, reproducible tests, disable CUDA, np.allclose tolerance

2025

2025, Oct 28 05:00

Matching Keras Dense and NumPy: CPU vs GPU numerical differences explained and how to get identical outputs

Learn why Keras Dense and NumPy matmul can differ by 1e-5 due to GPU vs CPU execution, and how to force parity or compare with tolerance for reproducible tests.

Minimal example that exhibits the mismatch

import numpy as np
import keras
from keras import layers
print("Keras version:", keras.__version__)
print("Backend", keras.backend.backend())
# Build a tiny model
src = layers.Input((2,), name='inp')
dense_out = layers.Dense(5, kernel_initializer='random_normal', use_bias=False, name='dense')(src)
toy_net = keras.Model(inputs=src, outputs=dense_out)
# Random input
feed = np.random.random(size=(5, 2)).astype(np.float32)
# Keras forward pass
y_keras = toy_net.predict(feed)
# Extract the Dense kernel
[weight_mat] = toy_net.layers[1].get_weights()
# NumPy forward pass
y_np = np.matmul(feed, weight_mat)
# Compare
print("Keras result:\n", y_keras)
print("NumPy result:\n", y_np)
print("Same result:", np.allclose(y_keras, y_np))

What actually causes the difference

The observed gap does not come from custom math in the Dense layer or from an extra hidden operation. It comes from where the math runs. NumPy executes on CPU. TensorFlow/Keras may execute the same matmul on GPU. Different hardware paths use different implementations for the same mathematical operation, which produces slightly different floating‑point results. That is enough to yield small deviations in otherwise identical computations.

Even with a small input of shape (5, 2) multiplied by a kernel of shape (2, 5), the accumulation of floating‑point operations on different execution units can lead to differences in the order of 1e-5 per element.

How to align Keras with NumPy

If you run the Keras model on CPU, you will get the same numbers as NumPy for this case. Disabling CUDA forces TensorFlow/Keras to use the CPU path and removes the discrepancy.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
import numpy as np
import keras
from keras import layers
print("Keras version:", keras.__version__)
print("Backend", keras.backend.backend())
src = layers.Input((2,), name='inp')
dense_out = layers.Dense(5, kernel_initializer='random_normal', use_bias=False, name='dense')(src)
toy_net = keras.Model(inputs=src, outputs=dense_out)
feed = np.random.random(size=(5, 2)).astype(np.float32)
y_keras = toy_net.predict(feed)
[weight_mat] = toy_net.layers[1].get_weights()
y_np = np.matmul(feed, weight_mat)
print("Same result:", np.allclose(y_keras, y_np))

Why this matters for practitioners

Small numerical shifts can propagate across layers and affect tests, reproducibility checks, or regression thresholds. If you verify a forward pass against a reference NumPy implementation, the compute device behind your deep learning framework directly affects bitwise and tolerance‑based comparisons.

Takeaways

If you need parity between TensorFlow/Keras and NumPy for linear layers, run both on CPU or both on GPU. When that is not possible, compare with a tolerance and expect tiny differences, even for simple matrix multiplications. Understanding that the discrepancy comes from CPU versus GPU execution helps you choose the right environment for debugging, validating, and writing reproducible tests.

The article is based on a question from StackOverflow by el_grezeq and an answer by el_grezeq.

keras numpy python tensorflow