2025, Nov 07 09:00

Efficient column-wise Spearman correlation with a reference vector: avoid full pairwise matrices, use NumPy ranks or tie-safe methods

Compute column-wise Spearman correlation to a reference vector: slice scipy.stats.spearmanr output or use fast NumPy rank-based methods, with tie-safe ranks.

When you need Spearman correlation between a reference vector and every column of a matrix, the obvious loop works but is slow. Trying to “vectorize” with scipy.stats.spearmanr on the full arrays often surprises by returning a large square matrix. Let’s unpack why that happens and how to get exactly the column-wise correlations efficiently, including a pure NumPy approach that avoids computing unnecessary pairs.

Problem setup

Suppose you have a data matrix with 37 observations and N samples, and a 1D reference vector of the same 37 observations. The goal is to compute the Spearman correlation between each sample (each column) and the reference, producing a result of shape (N, 1).

Here is a straightforward implementation that produces the right answer:

from scipy.stats import spearmanr
import numpy as np

prng = np.random.default_rng()
X = prng.standard_normal((37, 100))
ref_vec = prng.standard_normal((37, 1))

corr_list = []
for col in X.T:
    corr_list.append(spearmanr(col, ref_vec).statistic)

result = np.array(corr_list)
print(result.shape)  # (100,)

This computes what you want, but does it one column at a time. Attempting to call spearmanr on the full arrays looks promising at first glance:

from scipy.stats import spearmanr
import numpy as np

prng = np.random.default_rng()
X = prng.standard_normal((37, 100))
ref_vec = prng.standard_normal((37, 1))

pairwise = spearmanr(X, ref_vec)
print(pairwise.statistic.shape)  # (101, 101)

Instead of an (N, 1) vector, you get a (N+1, N+1) matrix. That isn’t a bug; it’s how the API works in 2D.

Why the shape explodes with spearmanr on 2D inputs

In the 1D case, spearmanr(x, y) returns a single correlation. In the 2D case, spearmanr(A) treats each column of A as a separate variable and computes the full pairwise correlation matrix across columns. Passing two 2D arrays is equivalent to concatenating their columns, then computing the pairwise correlations across all resulting columns. That’s why passing X with 100 columns and ref_vec with 1 column yields a 101×101 matrix: it computes every correlation among all 101 variables.

In other words, you asked for only N correlations, but ended up computing all N×N within X, plus everything involving the reference column, and then their symmetric counterparts. The information you actually want is just the correlations between each column of X and the reference vector.

Vectorized result without extra loops

Because the reference is the last column in the conceptual concatenation, you can extract exactly the column you need from the big matrix. Let n be the number of columns in X. The desired column-wise Spearman correlations are located in the last column, restricted to the first n rows.

from scipy.stats import spearmanr
import numpy as np

prng = np.random.default_rng()
X = prng.standard_normal((37, 100))
ref_vec = prng.standard_normal((37, 1))

n = X.shape[1]
full_corr = spearmanr(X, ref_vec).statistic
colwise_corr = full_corr[:n, n]  # shape: (n,)

This avoids manual Python loops and is fast for moderate N, even though it computes far more correlations than you ultimately keep. For the dimensions above, computing the full 101×101 matrix and slicing the last column is faster than looping column by column. As a caution, if N grows very large, paying for the full pairwise matrix will eventually become wasteful.

You might consider np.vectorize to keep the function signature clean, but it won’t fundamentally speed things up; it behaves similarly to a loop from a performance perspective.

from scipy.stats import spearmanr
import numpy as np

prng = np.random.default_rng()
X = prng.standard_normal((37, 100))
ref_vec = prng.standard_normal((37, 1))

def one_corr(u, v):
    return spearmanr(u, v).statistic

vec_apply = np.vectorize(one_corr, signature='(m),(m,1)->()')
colwise_corr = vec_apply(X.T, ref_vec)  # similar speed to a loop

Pure NumPy Spearman via ranks (no ties)

Spearman correlation is Pearson correlation of ranks. If your data are continuous and ties are not expected, you can compute ranks quickly with the double-argsort trick and then use a closed-form mean and variance for ranks. This approach avoids the big pairwise matrix and is vectorized across all columns.

import numpy as np

def spearman_fast_noties(A, b):
    # A: (m, n), b: (m, 1)
    m = b.shape[0]
    mean_r = (m - 1) / 2
    var_r = (m - 1) * (m + 1) / 12
    rA = A.argsort(axis=0).argsort(axis=0)             # ranks 0..m-1
    rb = b.argsort(axis=0).argsort(axis=0)             # ranks 0..m-1
    # Pearson on ranks with known mean/var
    return ((rA * rb).mean(axis=0) - mean_r**2) / var_r

# Example
prng = np.random.default_rng()
X = prng.standard_normal((37, 100))
ref_vec = prng.standard_normal((37, 1))
colwise_corr = spearman_fast_noties(X, ref_vec)

This version is both vectorized and avoids computing any unnecessary pairwise correlations. In tests shown above, it was drastically faster than the loop, because the rank computation with argsort is cheap for small m and everything happens in NumPy.

Handling ties correctly

If ties are possible and must be handled exactly, use scipy.stats.rankdata to compute ranks with tie handling. In that case you cannot use the closed-form mean and variance of ranks; compute them from the ranked arrays and then apply the usual Pearson formula on ranks.

import numpy as np
from scipy.stats import rankdata

def spearman_tie_safe(A, b):
    # A: (m, n), b: (m, 1)
    rA = rankdata(A, axis=0)
    rb = rankdata(b, axis=0)
    # Pearson on ranks with explicit centering and scaling
    return ((rA * rb).mean(axis=0) - rA.mean(axis=0) * rb.mean()) / (rA.std(axis=0) * rb.std())

# Example
prng = np.random.default_rng()
X = prng.standard_normal((37, 100))
ref_vec = prng.standard_normal((37, 1))
colwise_corr = spearman_tie_safe(X, ref_vec)

This approach is vectorized and exact with ties. For small m, rankdata can be slower than the double-argsort trick, but it guarantees correct tie handling.

Why this matters

Understanding how spearmanr treats 2D inputs prevents confusion and unnecessary computation. If you pass multiple columns, spearmanr gives you the full pairwise correlation matrix across all columns of the concatenated inputs. That can be useful, but when you only need correlations to a single reference vector, it’s often overkill. Knowing where your desired column lives lets you slice it out directly, and in turn, avoid writing manual loops. For workloads where the number of observations stays small and the number of samples is large, working directly with ranks can be significantly faster while computing only what you need.

Takeaways

Use a direct slice when calling spearmanr on the matrix and the reference to extract the last column of the result efficiently. If you want to skip the pairwise work entirely, compute ranks yourself and apply Pearson on ranks. When ties matter, rely on rankdata; when they do not, the double-argsort trick is an efficient alternative. Avoid np.vectorize for performance; it behaves similarly to a Python loop. With these patterns, you get predictable shapes, fewer surprises, and a solution that scales with your data layout.

The article is based on a question from StackOverflow by QuanticDisaster and an answer by chrslg.