2025, Dec 24 13:00

Fixing Gradient Descent in Linear Regression: Use Real Gradients, Not Cost, for Stable MSE Training

Learn why linear regression gradient descent diverges and how to fix the update rule using true MSE gradients—Python code to stop exploding parameters.

Linear regression is one of those places where the math is simple, but a tiny mistake in the update rule can send your parameters straight to infinity. A cost function that looks reasonable doesn’t guarantee your gradient descent is using the right signal for updates. Below is a compact walkthrough of a divergence issue and a working fix, with code you can drop in and run.

Problem overview

The cost function returns plausible values, but the gradient descent step makes the parameters explode within a handful of iterations until both cost and parameters become inf or NaN. That’s a telltale sign that the update direction or magnitude is off.

Code that reproduces the issue

Here is a minimal version of the reported setup. The cost function behaves fine, while the update rule is the source of the blow-up.

n = len(X)
def loss_fn(theta, beta, X, Y):
    J = 0.0
    for k in range(n):
        y_pred = theta * X[k] + beta
        diff = y_pred - Y[k]
        J += diff**2
    J = J / (n * 2)
    return J
def bad_gd(theta, beta, X, Y, steps, lr):
    dtheta = 0
    dbeta = 0
    for t in range(steps):
        J = loss_fn(theta, beta, X, Y)
        dtheta = J * X[t]
        dbeta = J
        theta = theta - lr * dtheta
        beta = beta - lr * dbeta
    return dtheta, dbeta

What’s actually wrong

The update rule uses the scalar cost value J as if it were the gradient. J is a magnitude of error, not a direction of steepest descent. For linear regression with mean squared error, the partial derivatives depend on the residuals (prediction minus target) and the inputs. Using J directly removes directional information and scales updates in a way that quickly destabilizes parameters. Another issue is that the updates don’t aggregate information across all training examples before each parameter step, so each iteration lacks the averaged signal the cost is computed from.

A minimal fix

The update should be driven by actual partial derivatives: accumulate error-weighted inputs for theta and raw errors for beta across the entire dataset, average them, then take a step.

def run_gd(theta, beta, X, Y, steps, lr):
    m = len(X)
    for t in range(steps):
        grad_theta = 0.0
        grad_beta = 0.0
        for idx in range(m):
            y_hat = theta * X[idx] + beta
            diff = y_hat - Y[idx]
            grad_theta += diff * X[idx]
            grad_beta += diff
        grad_theta /= m
        grad_beta /= m
        theta = theta - lr * grad_theta
        beta = beta - lr * grad_beta
        loss_val = loss_fn(theta, beta, X, Y)
        print(f"Iteration {t} || Cost = {loss_val} || w = {theta} || b = {beta}")
    return theta, beta

This keeps the cost calculation intact and drives the update with the correct derivatives, averaged over the full set of training examples each iteration.

Why this detail matters

Even tiny discrepancies between “cost” and “gradient of the cost” can dominate optimization dynamics. The cost is a scalar score; the gradient encodes both direction and scale for how to change the parameters. Mixing them up will look fine for a moment, then explode as updates compound in the wrong direction.

Takeaway

When implementing gradient descent for linear regression, compute the residuals, form the parameter-wise derivatives from those residuals, average across all samples, and only then update parameters. If the cost escalates rapidly and parameters jump by orders of magnitude, verify that your update uses the true gradients rather than the cost itself.