2025, Dec 17 15:00

Cumulative Mean and Standard Deviation by Group in Polars: Clean, Readable Patterns with Rolling Windows

Learn how to compute cumulative mean and standard deviation by group in Polars using rolling windows and cum_sum/cum_count for clear, concise, reliable code.

Computing cumulative mean and cumulative standard deviation in a Polars DataFrame sounds straightforward, but the most obvious approach can get verbose quickly, especially when you partition the calculation by a category. Here’s a concise walkthrough that keeps the logic clear and minimizes the risk of errors.

Baseline example: cumulative mean by group

Assume a small DataFrame with a numeric column and a grouping key. One way to derive a running mean is to divide a cumulative sum by a cumulative count within each group.

import polars as pl

tbl = pl.DataFrame({
    "val": [4, 6, 8, 11, 5, 6, 8, 15],
    "grp": ["A", "A", "B", "A", "B", "A", "B", "B"]
})

res = tbl.with_columns(
    running_avg = pl.col("val").cum_sum().over("grp") 
                  / pl.int_range(pl.len()).add(1).over("grp")
)

What’s going on and why this feels clunky

The expression builds a per-group running mean by taking a partitioned cumulative sum and dividing it by a per-group running index plus one. It works, but the use of a global range that is then partitioned is not the most readable approach and becomes less comfortable when you try to extend it to a cumulative standard deviation.

A cleaner approach with rolling windows

You can achieve cumulative mean and cumulative std using rolling functions. The idea is simple: apply rolling_mean and rolling_std over each group with a window as large as the entire frame and set min_samples=1. Within a group, as long as the group length is smaller than the window, the rolling window effectively behaves like a cumulative window. For cumulative mean, you can also keep a compact variant based on cum_sum and cum_count.

clean = tbl.with_columns(
    avg_cum = pl.col("val").cum_sum().over("grp") 
              / pl.col("val").cum_count().over("grp"),
    avg_cum_roll = pl.col("val").rolling_mean(
        window_size=tbl.shape[0],
        min_samples=1
    ).over("grp"),
    std_cum_roll = pl.col("val").rolling_std(
        window_size=tbl.shape[0],
        min_samples=1
    ).over("grp")
)

The rolling mean aligns with the cumulative mean. For the rolling standard deviation, the first entry of each group yields null, which matches the usual behavior when there is only a single observation.

Why this matters

Running aggregates are a staple in analytics pipelines, monitoring, feature computation and exploratory work. A readable expression is easier to maintain and reason about. The rolling approach avoids manual index arithmetic, and the compact cumulative-mean formula stays precise without extra moving parts. A more specific built-in for this pattern isn’t available, but the rolling family covers the need well.

Conclusion

For cumulative mean by group, prefer the short formula based on cum_sum divided by cum_count. For cumulative standard deviation, rely on rolling_std with a window equal to the number of rows and min_samples=1, and apply both via over to partition by the grouping key. This keeps the code focused, expressive and less error-prone when the logic grows.