2025, Nov 03 23:00

Ranking Category Frequencies by Month in Polars with rank().over() and a Clean Pivot

Learn how to rank category frequencies per month in Polars using windowed rank, not cumulative sums, then pivot results by rank. Clear code, tie behavior.

When you need to identify the most frequent categories per month in a Polars DataFrame and then lay them out by rank across months, the transformation looks straightforward on paper but can get verbose in code. The goal is to compute frequencies, rank categories within each month, and pivot the ranks so that each column represents a month and each row is a rank position.

Example dataset and a baseline approach

Below is a compact dataset that captures IDs, months, and category labels. Then comes a working, but less direct approach that manufactures ranks via a cumulative sum after sorting.

import polars as pl

records = pl.DataFrame(
    {
        "rec_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
        "period": [1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2],
        "label": [
            "C",
            "B",
            "A",
            "C",
            "B",
            "A",
            "C",
            "C",
            "A",
            "B",
            "A",
            "A",
            "C",
            "C",
            "A",
            "B",
            "B",
        ],
    }
)

(
    records.group_by(pl.col("period"), pl.col("label"))
    .agg(pl.col("rec_id").len().alias("freq"))
    .sort(by=["period", "freq"], descending=True)
    .with_columns(pl.lit(1).alias("flag"))
    .with_columns(pl.col("flag").cum_sum().over(["period"]).alias("rank_idx"))
    .pivot(index="rank_idx", values="label", on="period", sort_columns=True)
)

This gets the job done: you first count occurrences, sort by frequency within each month, fake a running counter per month using a cumulative sum, and then pivot. It works, but it is not the most direct expression of “rank within partition.”

What’s really going on

The heart of the task is ranking category counts per month. While the cumulative sum trick produces a rank-like index, Polars already has a built-in ranking operation. The missing piece is to apply rank within each month, which is exactly what the window expression with over accomplishes. Another subtlety is tie handling: using the ordinal method means ties are broken arbitrarily. In practice, with ties you might see a different order such as CBA instead of CAB in a given month’s column.

A simpler solution with rank and over

Here is a more concise and idiomatic pipeline that counts, ranks per month, and pivots in one smooth flow.

(
    records.group_by("period", "label")
    .len()
    .with_columns(
        pl.col("len").rank("ordinal", descending=True).over("period").alias("rank_idx")
    )
    .sort("rank_idx")
    .pivot("period", index="rank_idx", values="label", sort_columns=True)
)

One practical detail: the resulting rank_idx is a u32, which differs from i64; if the integer type matters in your downstream logic, you can cast at the end.

Why this matters

Using rank with a window over the month eliminates incidental complexity and expresses intent directly. The pipeline is easier to read, reason about, and maintain, especially when the ranking step is the crux of the transformation. It also makes tie behavior explicit via the chosen rank method.

Takeaways

For ranking within groups in Polars, prefer rank with over rather than building ranks indirectly with cumulative sums. Count per (month, category), rank counts in descending order partitioned by month, sort by that rank, and pivot to align categories by rank across months. Be aware that ordinal ranking breaks ties arbitrarily, and cast the rank type if your output schema requires it.

The article is based on a question from StackOverflow by Simon and an answer by BallpointBen.

python python-polars