2025, Nov 10 01:00

NumPy MNIST Classifier Stuck at 10%? Fix Gradient Axes, Bias Shapes, and Use a Stable Softmax

Why a tiny NumPy MNIST network sticks near 10% and how to fix it: correct gradient axes, reduce bias over axis 0, and use a stable softmax for better backprop.

Training a tiny MNIST classifier from scratch in NumPy often fails for a very mundane reason: gradients computed along the wrong axes. The model looks fine, the loss seems to move, but accuracy stubbornly sits near chance. In a two-layer net, a small axis mistake in just one or two places is enough to corrupt every update step. Below is a minimal, reproducible path from the problem to a working fix.

Problem setup and symptoms

The network is a straightforward ReLU hidden layer followed by a softmax head. Data is flattened to shape (samples, features). Forward computation matches that convention, but the backward pass mixes up matrix orientations and squeezes bias gradients along the wrong axis. The immediate effect is that parameter updates use mis-shaped tensors, which quietly broadcasts and overwrites weights in ways that keep the model hovering around 10–14% accuracy and then drifting back to chance.

Code that exhibits the issue

The following snippet mirrors the faulty gradient flow. Names are concise, but the logic matches the broken version exactly, including the problematic axes and transposes.

import numpy as np
w_in = np.random.randn(784, 10)
b_in = np.random.randn(10)
w_out = np.random.randn(10, 10)
b_out = np.random.randn(10)
def act_relu(z, deriv=False):
    if deriv:
        return z > 0
    return np.maximum(z, 0)
def act_softmax(y):
    ex = np.exp(y - np.max(y))
    return ex / ex.sum(axis=1, keepdims=True)
def to_onehot(y):
    z = np.zeros((len(y), 10))
    for i in range(len(y)):
        z[i][y[i]] = 1
    return z
def forward_pass(x):
    z1 = x.dot(w_in) + b_in
    a1 = act_relu(z1)
    z2 = a1.dot(w_out) + b_out
    a2 = act_softmax(z2)
    return z1, a1, z2, a2
def accuracy(probs, y_true):
    preds = probs.argmax(axis=1)
    total = 0
    i = 0
    while i < len(preds):
        if preds[i] == y_true[i]:
            total += 1
        i += 1
    return (total / len(preds)) * 100
def apply_step(lr, d_w1, d_b1, d_w2, d_b2):
    global w_in, b_in, w_out, b_out
    w_in = w_in - lr * d_w1.T
    b_in = b_in - lr * d_b1
    w_out = w_out - lr * d_w2
    b_out = b_out - lr * d_b2
def fit_loop(x_data, y_data, epochs=40, lr=0.1):
    for ep in range(epochs):
        y_hot = to_onehot(y_data)
        m = len(y_hot)
        z1, a1, z2, a2 = forward_pass(x_data)
        d_z2 = a2 - y_hot
        d_w2 = (1 / m) * d_z2.T.dot(a1)
        d_b2 = (1 / m) * np.sum(d_z2, axis=1)
        d_z1 = w_out.dot(d_z2.T).T * act_relu(z1, deriv=True)
        d_w1 = (1 / m) * d_z1.T.dot(x_data)
        d_b1 = (1 / m) * np.sum(d_z1)
        apply_step(lr, d_w1, d_b1, d_w2, d_b2)
        probs = forward_pass(x_data)[-1]
        print("epoch", ep + 1, "acc", accuracy(probs, y_data))

What actually goes wrong

The gradient tensors are oriented inconsistently with the data layout. The weight gradients use unnecessary transposes, and both bias gradients are reduced over the wrong axis, producing shapes that do not match the parameter vectors. Because NumPy broadcasting masks these mistakes, the code runs but the updates mangle the parameters on each step. That’s why the model sticks around chance, approximately 10%.

The fix: correct axes and shapes in backprop

The resolution is simple: compute gradients with consistent matrix shapes, reduce biases along the class dimension, and remove compensating transposes during the update. A numerically-stable softmax also helps. Below is a corrected pass using the same interface as above.

import numpy as np
w_in = np.random.randn(784, 10)
b_in = np.random.randn(10)
w_out = np.random.randn(10, 10)
b_out = np.random.randn(10)
def act_relu(z, deriv=False):
    if deriv:
        return z > 0
    return np.maximum(z, 0)
def softmax_stable(z):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)
def to_onehot(y):
    z = np.zeros((len(y), 10))
    for i in range(len(y)):
        z[i][y[i]] = 1
    return z
def forward_pass(x):
    z1 = x @ w_in + b_in              # (m,784) @ (784,10) -> (m,10)
    a1 = act_relu(z1)
    z2 = a1 @ w_out + b_out           # (m,10)
    a2 = softmax_stable(z2)           # (m,10)
    return z1, a1, z2, a2
def accuracy(probs, y_true):
    preds = probs.argmax(axis=1)
    total = 0
    i = 0
    while i < len(preds):
        if preds[i] == y_true[i]:
            total += 1
        i += 1
    return (total / len(preds)) * 100
def fit_loop(x_data, y_data, epochs=40, lr=0.1):
    for ep in range(epochs):
        y_hot = to_onehot(y_data)
        m = x_data.shape[0]
        z1, a1, z2, a2 = forward_pass(x_data)
        d_z2 = a2 - y_hot              # (m,10)
        d_w_out = a1.T @ d_z2 / m      # (10,10)
        d_b_out = d_z2.sum(0) / m      # (10,)
        d_z1 = (d_z2 @ w_out.T) * (z1 > 0)   # (m,10)
        d_w_in = x_data.T @ d_z1 / m         # (784,10)
        d_b_in = d_z1.sum(0) / m             # (10,)
        # SGD step with matching shapes, no extra transposes
        global w_in, b_in, w_out, b_out
        w_out -= lr * d_w_out
        b_out -= lr * d_b_out
        w_in  -= lr * d_w_in
        b_in  -= lr * d_b_in
        probs = a2
        print("epoch", ep + 1, "acc", accuracy(probs, y_data))

Notice that in each matrix product, the transpose appears on the left operand where it naturally belongs for gradient accumulation. Biases reduce across axis 0 to match the class dimension, producing vectors that align with parameter shapes. With these fixes in place, the model stops trashing its weights and starts to learn.

Safer softmax for stability

A stable softmax subtracts the per-row maximum before exponentiation, which prevents overflow and improves numerical behavior throughout training. The function above, softmax_stable, implements exactly that pattern and matches the intended (m, 10) output shape.

Why it’s worth internalizing

Backprop in NumPy is unforgiving: if even one axis is off, NumPy will happily broadcast and compute something else. The model will run, appear stable, and flatline around chance. Correcting the axes aligns the math with the data layout and removes the need for compensating transposes in the parameter update.

When the shapes are fixed and a numerically safe softmax is used, the same two-layer design trains as expected. With the corrections plus He initialization and a smaller learning rate like 0.01, this setup reaches about 93% on MNIST in 15–20 epochs. In practice, results are consistent with that range. One reported run with He initialization reached 91% with a learning rate of 0.25 over 200 epochs. Another experiment produced around 60% at a learning rate of 0.5 over 300 epochs.

Practical takeaways

If your NumPy classifier sits near 10–14% and refuses to improve, inspect gradient shapes first. Ensure dW terms are formed as input.T @ delta and hidden.T @ delta for their respective layers, and that both db terms reduce along axis 0 to match the number of units. Remove any transposes in the update that were compensating for upstream mistakes, and use a stable softmax. Once the axes are consistent, the same minimal network becomes a solid baseline for MNIST.

The article is based on a question from StackOverflow by buzzbuzz20xx and an answer by Dmitry543.