2025, Oct 18 14:00

Polars Duplicate Detection Without Short-Circuiting: Faster Single- and Multi-Column Strategies

Learn why Polars is_duplicated().any() doesn't short-circuit in LazyFrames, and duplicate checks: unique vs len for a column, rle_id for multi-column keys.

Detecting duplicates in Polars often looks like a case for early exit: as soon as the first duplicate appears, stop and report true. In practice, the naive approach doesn’t short-circuit, and that detail alone can turn a seemingly optimal plan into a full scan of the entire dataset. Below is a walkthrough of why that happens, how to verify it, and what to use instead for both single- and multi-column checks.

Problem setup

Consider a helper that attempts to return true on the first duplicate across a subset of columns. The intent is to exit early by combining is_duplicated with any.

def try_fast_dupe_flag(lf: pl.LazyFrame, keys: list[str]) -> bool:
    """Attempt to exit early on first duplicate"""
    return (
        lf.select(
            pl.struct(keys).is_duplicated().any()
        )
        .collect()
        .item()
    )

The question is whether this is the most efficient approach, especially when duplicates appear very early.

Why it doesn’t short-circuit

Even if a duplicate is found immediately, this expression doesn’t end early. The core reason is that is_duplicated doesn’t know about the later any call in the expression tree. It has no information that a single true is enough to stop, so it proceeds to compute over the full input. That means the placement of duplicates—early or late—doesn’t change the overall runtime.

You can observe that by timing two large LazyFrame pipelines: one with duplicates at the head and another with duplicates at the tail. The results are essentially identical.

lead = (
    pl.select(x=pl.concat_list(0, pl.int_ranges(0, 100_000_000)))
      .explode('x')
      .lazy()
)

trail = (
    pl.select(x=pl.concat_list(pl.int_ranges(0, 100_000_000), 0))
      .explode('x')
      .lazy()
)

%%timeit
try_fast_dupe_flag(lead, 'x')
# 5.12 s ± 290 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
try_fast_dupe_flag(trail, 'x')
# 5.14 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Because the computation cannot short-circuit inside the expression graph, both cases cost about the same. To truly short-circuit, you would need a plugin designed to stop early or resort to map_batches with a manual Python loop, which will typically be slower except in extreme scenarios where the first elements immediately collide.

Short-circuiting in Python with map_batches

If you want to see early exit behavior in action, a Python implementation with map_batches can do it, although this approach is not suitable for production pipelines due to the Python-level loop and its overhead.

import time

def python_dupe_probe(s: pl.Series) -> pl.Series:
    start = time.time()
    for ii, lval in enumerate(s):
        for jj, rval in enumerate(s[ii + 1:]):
            if time.time() - start > 10:
                print(f"took too long, at i={ii} j={jj}")
                return pl.Series([None], dtype=pl.Boolean)
            if lval == rval:
                return pl.Series([True])
    return pl.Series([False])

lead.select(pl.struct('x').map_batches(python_dupe_probe)).collect()
# this takes 0.0sec

trail.select(pl.struct('x').map_batches(python_dupe_probe)).collect()
# took too long, at i=0 j=27725000

In the extreme case where the first two values are equal, the function returns almost instantly. When duplicates are far apart, processing becomes prohibitive.

Faster alternatives without early exit

If you only need to check a single column, comparing the cardinality of unique values to the total count is very efficient and works well regardless of duplicate positions.

lead.select(pl.col('x').unique().len() != pl.col('x').len()).collect()
trail.select(pl.col('x').unique().len() != pl.col('x').len()).collect()
# each take about 0.7sec

Switching to a struct-based unique on a single column, however, is dramatically slower.

lead.select(pl.struct('x').unique().len() != pl.struct('x').len()).collect()
trail.select(pl.struct('x').unique().len() != pl.struct('x').len()).collect()
# each take about 1m8s

For multiple columns, sorting and leveraging rle_id is a practical technique that outperforms the is_duplicated approach, though it is still slower than the single-column unique vs len method.

lead.select(
    pl.struct('x').sort().rle_id().max() + 1 != pl.struct('x').len()
).collect()

trail.select(
    pl.struct('x').sort().rle_id().max() + 1 != pl.struct('x').len()
).collect()
# each take about 3s

The idea is straightforward. After sorting, consecutive equal rows form runs. If all rows are unique, there are no runs, and the maximum rle_id is exactly one less than the total length because indices start at zero.

Practical function

Putting the pieces together, a practical helper can branch between the highly efficient single-column strategy and the multi-column rle_id trick. It doesn’t short-circuit, but it is faster in practice than the initial is_duplicated + any approach.

def dupe_exists(lazy_ds: pl.LazyFrame, keys: list[str]) -> bool:
    """No early exit, but fast in practice"""
    if len(keys) == 1:
        return (
            lazy_ds.select(
                pl.col(keys[0]).unique().len() != pl.col(keys[0]).len()
            )
            .collect()
            .item()
        )
    else:
        return (
            lazy_ds.select(
                pl.struct(keys).sort().rle_id().max() + 1 != pl.struct(keys).len()
            )
            .collect()
            .item()
        )

Why this matters

When working with Polars LazyFrame expressions, it is easy to assume that operators like any will cause downstream nodes to short-circuit. In reality, nodes like is_duplicated operate without awareness of later aggregations in the chain, so they compute fully. Understanding this helps avoid performance traps and makes it clear when a manual short-circuit is even worth considering. If the data distribution places duplicates early only occasionally, a Python loop via map_batches is unlikely to pay off. When duplicates are consistently at the very beginning, a specialized plugin or custom logic might be justified.

Conclusion

Early exit sounds tempting, but Polars expressions won’t short-circuit just because any appears later in the pipeline. For a single column, comparing unique length to total length is an efficient and simple way to detect duplicates. For multiple columns, sorting and using rle_id provides a strong option without resorting to Python-level loops. If true short-circuiting is a hard requirement, the viable routes are a purpose-built plugin or a manual map_batches implementation, with the latter making sense only in extreme cases.

The article is based on a question from StackOverflow by Nicolò Cavalleri and an answer by Dean MacGregor.