2025, Sep 23 01:00

Polars head(n) in group_by().agg(): Keep columns aligned and get the first N rows per group

Learn how Polars group_by().agg() with head(n) preserves row alignment across columns, when misalignment can occur, and why group_by(...).head(n) can be safer.

When you aggregate in Polars and apply head(n) on multiple columns within the same group_by().agg(), do you actually get values taken from the same original rows? That’s a practical question if you’re slicing the first K items per group across several fields and expect them to line up.

Example

The setup below groups by a key and applies head(2) to three columns at once. The goal is consistent alignment, i.e., that topic, vec and flag originate from the same two rows within each group.

import polars as pl

# toy data
tbl = pl.DataFrame({
    "grp": ["A", "A", "A", "B", "B"],
    "topic": ["i1", "i2", "i3", "i4", "i5"],
    "vec": ["e1", "e2", "e3", "e4", "e5"],
    "flag": ["a1", "a2", "a3", "a4", "a5"],
})

res = (
    tbl.group_by("grp")
    .agg([
        pl.col("topic").head(2).alias("topics"),
        pl.col("vec").head(2).alias("vecs"),
        pl.col("flag").head(2).alias("flags"),
    ])
)

print(res)

What’s actually happening

Within each group, Polars preserves the row order. Applying head on a column inside agg() therefore takes the first n rows of that specific group. Because the group’s internal order is stable, doing this for multiple columns will pick values from the same original row indices.

Within each group, the order of rows is always preserved, regardless of this argument.

So there’s no misalignment risk when using head this way. A relevant contrast is sample(), which can misalign data if you sample multiple columns within agg().

Solution and a safer alternative

The head-in-agg approach is fine and aligns as expected. If you prefer to work with actual rows instead of list-aggregated columns, you can also use group_by(...).head(n), which returns pre-exploded rows from each group. This removes any doubt about alignment because you directly get the first n rows per group.

import polars as pl

# same data
tbl = pl.DataFrame({
    "grp": ["A", "A", "A", "B", "B"],
    "topic": ["i1", "i2", "i3", "i4", "i5"],
    "vec": ["e1", "e2", "e3", "e4", "e5"],
    "flag": ["a1", "a2", "a3", "a4", "a5"],
})

# returns the first 2 rows within each group (not the whole dataframe)
rows = tbl.group_by("grp").head(2)
print(rows)

It’s easy to assume head(n) without agg would operate on the entire DataFrame, but in this form it scopes to each group and takes the first n rows per group.

Why this matters

Feature engineering, ranking, and top-K extraction often rely on taking the first n items from multiple columns in lockstep. Knowing that head preserves row alignment inside groups lets you confidently aggregate without post-fixups. If you need row-shaped outputs rather than lists, group_by(...).head(n) keeps results simple and predictable.

Takeaways

Applying head(n) on multiple columns inside group_by().agg() preserves alignment because group order is preserved and head selects the first rows of each group. For an even clearer path, group_by(...).head(n) returns the first n rows within every group as rows, not lists. Keep in mind that not all operations share this property; for instance, sampling multiple columns inside agg() can misalign them.

The article is based on a question from StackOverflow by K_Augus and an answer by BallpointBen.