2025, Nov 05 09:00

Understanding statsmodels.ztest: no population standard deviation, pooled vs unequal variance

Learn why statsmodels ztest estimates variance from data, how usevar toggles pooled vs unequal in two-sample tests, and when a population std is unnecessary

Why statsmodels.ztest doesn’t ask for population std — and what it does instead

It’s common to see Python z-tests written without any explicit population standard deviation. This looks suspicious at first glance, especially if you’re used to textbook formulas that plug in a known population std_dev. Let’s look at what the popular statsmodels implementation actually does under the hood and how to control its behavior when you have two samples.

Minimal example

from statsmodels.stats.weightstats import ztest as run_ztest

# IQ scores for 20 patients after medication
iq_after_drug = [88, 92, 94, 94, 96, 97, 97, 97, 99, 99,
                 105, 109, 109, 109, 110, 112, 112, 113, 114, 115]

# Test against a mean of 100
run_ztest(iq_after_drug, value=100)

What’s going on under the hood

The library does not require a population standard deviation because it computes the variance from the data you pass in. In other words, the sample variance — and therefore the standard deviation — is derived internally. The implementation detail is visible in the statsmodels source code at https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html#ztest.

When you run a two-sample z-test, the default way to handle standard deviations is "pooled". That means the method proceeds under the assumption that both samples share the same variance. If you don’t want that, you can switch the behavior by setting the usevar argument to "unequal", which makes the function compute standard deviations for the samples separately.

If by "std" you mean "std_dev", that’s the same quantity being discussed. The distinction matters in many teaching materials, and you can see this usage reflected in step-by-step guides as well.

Solution and how to control variance handling

There’s nothing missing in the one-sample call: the function estimates variability from the data. For two-sample scenarios, you can keep the default pooled approach or request separate variance handling. The switch is made via the usevar parameter, as in "pooled" or "unequal". The actual computation of the standard deviation then follows the chosen mode internally.

# One-sample z-test: variance is computed from the sample
run_ztest(iq_after_drug, value=100)

# In two-sample form (illustrative signature):
# run_ztest(sample_a, sample_b, usevar="unequal")

Why this matters

Understanding that the function estimates variability from your data clarifies why you aren’t asked for a population std_dev. It also makes the default behavior in two-sample tests explicit: "pooled" assumes equal variances, while "unequal" keeps them separate. Knowing which mode you’re in helps you interpret results consistently and avoid silent mismatches with your statistical intent.

Takeaways

If you’re performing a one-sample z-test with statsmodels, you don’t need to pass a population standard deviation because the method computes the sample variance internally. For two-sample z-tests, remember that the default is "pooled" variance and you can opt into "unequal" via usevar when you want the standard deviations computed separately. When in doubt, consult the implementation details in the official source to see exactly how the calculation is performed.

The article is based on a question from StackOverflow by Rushank Savant and an answer by paisanco.

data-science python statistics statsmodels