2025, Nov 12 15:00

How to Sample from Polars List Columns: Take Head, Random Middle, Tail Safely and Keep Unique Values

Learn a robust Polars pattern to sample list columns: take head and tail, pick two random middle values safely, avoid ShapeError, and return unique results.

Sampling from list columns in Polars is straightforward until you need to mix fixed slices with a random subset and keep the result safe for short lists. This guide shows how to pick the first two values, the last two values, and two random values from the middle of each list, then keep unique elements. It also explains why a naïve approach fails and how to implement a robust solution that works for any list length.

Reproducing the setup and the failure

The data has a list column, and the requirement is to take two from the head, two random values from the middle portion, and two from the tail. If the list has six or fewer elements, return the whole list. Here is a compact example and an initial attempt that breaks:

import polars as pl

data = pl.DataFrame(
    {
        "key": ["a", "b"],
        "nums": [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]],
    }
)

broken = (
    data.select(
        top=pl.col("nums").list.head(2),
        mid=pl.col("nums")
            .list.set_difference(pl.col("nums").list.head(2))
            .list.set_difference(pl.col("nums").list.tail(2))
            .list.sample(2, seed=1234),
        bottom=pl.col("nums").list.tail(2),
    )
    .select(pick=pl.concat_list(["top", "mid", "bottom"]).list.unique())
)

This attempt throws an exception because the middle slice may contain fewer than two elements in short lists:

ShapeError: cannot take a larger sample than the total population when `with_replacement=false`

Why this breaks

The middle part is computed by removing the first two and the last two values from the list. For lists with length five or less, that remainder is smaller than two. Sampling without replacement with n greater than the available population is invalid, which is exactly what triggers the ShapeError.

An idiomatic way to pick two from the middle

Instead of asking list.sample for a fixed count that might be infeasible, shuffle the middle segment and take the first two from the shuffled result. By using fraction=1 with shuffle=True, the middle list is permuted but not resized, and head(2) safely limits the output to two or fewer values depending on availability.

data.select(
    pl.col("nums")
      .list.head(pl.col("nums").list.len() - 2)
      .list.slice(2)
      .list.sample(fraction=1, shuffle=True)
      .list.head(2)
      .alias("mid_two")
)

This expression isolates the middle portion by cutting off two elements from each end, shuffles that portion, and picks up to two values without risking an oversample.

Assembling the final selection with a length guard

The next step is to combine head, shuffled middle, and tail, and to return the original list when it is too short. The condition for the switch is whether the list length is greater than five. The result keeps unique elements as requested.

result = data.with_columns(
    pl.when(pl.col("nums").list.len() > 5)
      .then(
          pl.concat_list(
              pl.col("nums").list.head(2),
              pl.col("nums")
                .list.head(pl.col("nums").list.len() - 2)
                .list.slice(2)
                .list.sample(fraction=1, shuffle=True)
                .list.head(2),
              pl.col("nums").list.tail(2),
          )
      )
      .otherwise(pl.col("nums"))
      .list.unique()
      .alias("pick")
)

For lists longer than five, this yields two from the start, two randomly chosen from the middle segment, and two from the end. For shorter lists, it returns the original list unchanged. Applying list.unique ensures that the final list contains unique values.

Alternative: make sampling size conditional

Another approach is to leave the middle list unshuffled and conditionally set how many items to sample. By supplying an expression to n, sampling returns zero items when the list is not large enough, which makes the entire expression safe for all lengths.

alt = data.with_columns(
    pl.when(pl.col("nums").list.len() > 5)
      .then(
          pl.concat_list(
              pl.col("nums").list.head(2),
              pl.col("nums")
                .list.head(pl.col("nums").list.len() - 2)
                .list.slice(2)
                .list.sample(
                    n=pl.when(pl.col("nums").list.len() > 5).then(2).otherwise(0)
                ),
              pl.col("nums").list.tail(2),
          )
      )
      .otherwise(pl.col("nums"))
      .list.unique()
      .alias("pick")
)

This variant avoids shuffling entirely and uses a guard on the sample size to prevent errors. It produces the same shape and honors the uniqueness requirement.

Why this pattern is worth remembering

Mixing deterministic slices with randomized picks is a common data-wrangling task. The pattern of shuffling with fraction=1 and trimming with head is robust and keeps you away from edge cases where the requested sample size exceeds the available population. Using when/then to switch between the transformed list and the original one makes the entire pipeline predictable and maintainable.

Takeaways

When sampling from list columns in Polars, avoid requesting a fixed n that can be larger than the available population. Either shuffle and slice or make n conditional. Combine list.head, list.tail, and list.slice to segment the list, use list.sample with fraction=1, shuffle=True to permute a segment safely, and select between the original and the assembled result with when/then. Finish with list.unique when you need to deduplicate the final selection.

python python-polars