2025, Dec 30 07:00

PCA Changes After Row Reordering: Understanding Floating-Point Effects and How to Stabilize scikit-learn Results

Why PCA outputs shift after row reordering: floating-point centering and ARPACK sensitivity. Fix in scikit-learn with float64 or full solver; reproducible demo

Minor numeric shifts in PCA when you merely reshuffle rows of the same matrix often look suspicious, but in this case they are expected. The differences are small and come from floating point precision and the behavior of an iterative eigensolver.

Reproducible case

The example below builds two matrices with identical values but different row order, runs scikit-learn PCA with svd_solver="arpack", and compares the transformed scores. The arrays differ at the level of 1e−5 to 1e−3 in this setup.

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import scipy.sparse as sp

# Synthetic sparse-like setup with two differently ordered views of the same data
n_rows = 10000
n_cols = 2000

rs = np.random.default_rng(42)
A_main = rs.uniform(0, 5, size=(n_rows, n_cols)).astype(np.float32)
A_reordered = np.concatenate((A_main[5000:,], A_main[:5000,]), axis=0)

S_main = sp.csr_matrix(A_main)
S_reordered = sp.csr_matrix(A_reordered)
print("CSR A shape:", S_main.shape, "nnz:", S_main.nnz)
print("CSR B shape:", S_reordered.shape, "nnz:", S_reordered.nnz)

# PCA with ARPACK on both orderings
pca_a = PCA(n_components=20, svd_solver="arpack", random_state=321)
Z_a = pca_a.fit_transform(S_main.toarray())

pca_b = PCA(n_components=20, svd_solver="arpack", random_state=321)
Z_b = pca_b.fit_transform(S_reordered.toarray())

# Align rows back to the original order
row_labels = [f"cell{i}" for i in range(n_rows)]
df_a = pd.DataFrame(Z_a, index=row_labels)
df_b = pd.DataFrame(Z_b, index=row_labels[5000:] + row_labels[0:5000])
df_b = df_b.loc[df_a.index,]

print("\nAre components numerically close (1e-6)?",
      np.allclose(np.abs(df_a), np.abs(df_b), atol=1e-6))

A typical difference here looks like this: one value could be −2.9909344 while the other is −2.9909678, a gap of about 3.34e−5. In one run the maximum rounded absolute difference reached 0.00677, with a mean absolute difference around 0.0005612017.

Why the numbers move when rows are reordered

PCA in scikit-learn centers the data by subtracting the per-feature mean. The estimator exposes this explicitly:

mean_ : ndarray of shape (n_features,)

Per-feature empirical mean, estimated from the training set.

Equal to X.mean(axis=0).

The mean is computed via a sum over floating point numbers, and floating point addition is not associative. Reordering samples changes the sequence of additions and therefore changes the rounding pathway, which slightly perturbs the computed means and everything derived from them. The effect is tiny but real. There is also an algorithmic angle: arpack is an iterative solver, and iterative methods are sensitive to input perturbations; they can amplify small iterative errors. Combined, these two aspects yield small but noticeable differences after PCA when only the row order changes.

How to make PCA outputs more stable to row order

If you need tighter agreement across different row orders, increase numeric precision. Switching from np.float32 to np.float64 reduces accumulated rounding error in the centering step and downstream linear algebra. Another practical lever is to switch from arpack to the "full" solver. Both approaches aim to curb the sensitivities that show up with row reordering.

The following snippet keeps the same logic but computes in 64-bit precision while still using arpack:

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

n_rows = 10000
n_cols = 2000

rs = np.random.default_rng(42)
A_main64 = rs.uniform(0, 5, size=(n_rows, n_cols)).astype(np.float64)
A_reordered64 = np.concatenate((A_main64[5000:,], A_main64[:5000,]), axis=0)

pca_a64 = PCA(n_components=20, svd_solver="arpack", random_state=321)
Z_a64 = pca_a64.fit_transform(A_main64)

pca_b64 = PCA(n_components=20, svd_solver="arpack", random_state=321)
Z_b64 = pca_b64.fit_transform(A_reordered64)

row_labels = [f"cell{i}" for i in range(n_rows)]
DF_a64 = pd.DataFrame(Z_a64, index=row_labels)
DF_b64 = pd.DataFrame(Z_b64, index=row_labels[5000:] + row_labels[0:5000])
DF_b64 = DF_b64.loc[DF_a64.index,]

Alternatively, keep 32-bit inputs and use the "full" solver:

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

n_rows = 10000
n_cols = 2000

rs = np.random.default_rng(42)
A32 = rs.uniform(0, 5, size=(n_rows, n_cols)).astype(np.float32)
B32 = np.concatenate((A32[5000:,], A32[:5000,]), axis=0)

pca_full_a = PCA(n_components=20, svd_solver="full", random_state=321)
U_a = pca_full_a.fit_transform(A32)

pca_full_b = PCA(n_components=20, svd_solver="full", random_state=321)
U_b = pca_full_b.fit_transform(B32)

labels = [f"cell{i}" for i in range(n_rows)]
U_a_df = pd.DataFrame(U_a, index=labels)
U_b_df = pd.DataFrame(U_b, index=labels[5000:] + labels[0:5000])
U_b_df = U_b_df.loc[U_a_df.index,]

Why this matters in real pipelines

This is more than a numerical curiosity. In one practical workflow, switching to np.float64 eliminated the discrepancy and aligned results across differently ordered inputs. In domains like single-cell RNA sequencing, matrices are often held as np.float32 to save memory, and small numeric differences at the PCA stage can carry over into downstream steps such as Leiden clustering. Understanding where the variability originates makes it easier to choose precision and solver settings that fit the stability you need.

Takeaways

Yes, seeing slightly different floating point numbers from PCA after shuffling rows is expected. The centering step depends on floating point summation, which is order-sensitive, and arpack can magnify tiny perturbations. If bit-for-bit closeness matters, lean on np.float64 or prefer the "full" solver. Even when the numbers differ, the gaps are typically small and consistent with the limits of floating point arithmetic.