2025, Nov 01 01:00

Avoid OOM when sampling massive columns in Polars LazyFrame: use row count, random indices, and gather for streaming-friendly selection

Sampling huge columns in Polars LazyFrame can OOM by materializing data. Use a memory-safe method: count rows, choose random indices, and gather for streaming.

Sampling a massive column from a Polars LazyFrame looks deceptively simple until the process explodes in memory. If your dataset is large and a single wide column (like raw web page text) is involved, a naive sample on a LazyFrame can still trigger a full materialization and crash. Here is how to reason about it and an alternative that can help in memory-bound workflows.

The setup and the failing approach

Data is scanned from Parquet with pl.scan_parquet(...) because a full DataFrame would not fit in memory. Sampling one large column causes the pipeline to fail even when the requested sample size is 1, while the same operation succeeds on a smaller column.

import polars as pl
src_path = "path/to/data.parquet"
lz = pl.scan_parquet(src_path)
samp_n = 1  # example size; even this blows up when the column is huge
subset = lz.select(
    pl.col("content_col").sample(n=samp_n, seed=0)
)
# (1) writing in one go fails here in this scenario
subset.sink_parquet("subset.parquet")
# (2) collecting the sampled result also fails
subset.collect()

What is really going on

Collect should generally use the streaming engine by default, but sample is likely not optimized for streaming in this context and can force the engine to read the entire column into memory. With a very large text-like column, that is enough to blow past available RAM, even if the final sample would itself be small.

A more memory-conscious path

Instead of sampling via sample, select by explicit row indices. First determine the number of rows, generate a set of random indices, and then gather only those rows. The key idea is to avoid an operation that implicitly materializes the whole column. The following demonstrates the approach.

import polars as pl
import numpy as np
plan = pl.LazyFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
k = 2
total_rows = plan.select(pl.len()).collect().item()
take_idx = sorted(np.random.choice(total_rows, size=k, replace=False))
plan.select(pl.col("x").gather(take_idx)).collect()

This method uses gather with a precomputed list of row indices. The question to verify in practice is whether gather in your pipeline path can leverage streaming sufficiently to avoid running out of memory. If it does, this pattern lets you take a small, truly random slice from a massive dataset without inflating memory usage.

Why remembering this helps

In lazy, streaming-first engines the behavior of each expression matters. Two seemingly similar operations can have very different execution plans. Knowing that sample can pull everything into memory, while an index-driven selection may avoid that, gives you a concrete strategy when you are constrained by RAM.

Wrap-up

If sampling a large column in Polars LazyFrame leads to out-of-memory errors, avoid sample and switch to an index-based selection: compute the row count, choose random row positions, and gather those rows. This keeps the workflow closer to the streaming execution model and reduces the chance of materializing the full, heavy column. Test this approach in your environment, especially on the wide column that caused the crash, and keep your sampling logic aligned with operations that can be planned efficiently by the engine.

The article is based on a question from StackOverflow by q.uijote and an answer by BallpointBen.