2026, Jan 12 17:00
Find customer pairs who share a StockCode in Polars using an efficient lazy self-join
Learn how to find customer pairs sharing at least one StockCode in Polars with a lazy self-join, avoiding blowups and speeding up co-purchase analysis
Finding all customer pairs who bought at least one common product is a classic co-occurrence problem. If you try to build pairs by enumerating all customers and then comparing their product sets, you hit a combinatorial wall fast. With Polars, you can push this work into a join that surfaces only the pairs that actually share a product, skipping the blow-up of generating everything first.
Repro: the situation and the naive direction
Suppose you have a transactional table with CustomerID, StockCode and Total. The goal is to emit unique pairs of customers that purchased at least one identical StockCode.
import polars as pl
tx = pl.from_repr("""
┌─────────────┬────────────┬───────┐
│ CustomerID ┆ StockCode ┆ Total │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═════════════╪════════════╪═══════╡
│ A ┆ 123 ┆ 45.78 │
│ A ┆ 140 ┆ 10.26 │
│ B ┆ 125 ┆ 99.62 │
│ B ┆ 128 ┆ 23.65 │
│ B ┆ 140 ┆ 92.95 │
│ C ┆ 123 ┆ 45.78 │
│ D ┆ 145 ┆ 7.58 │
└─────────────┴────────────┴───────┘
""")A straightforward, but inefficient approach is to enumerate all possible customer pairs and compare their purchases later. For example:
from itertools import combinations
id_pool = pl.Series(
tx.select("CustomerID")
.unique()
.drop_nulls()
.collect()
).to_list()
every_pair = combinations(id_pool, 2)This creates all possible pairs whether they share a product or not, pushing unnecessary work downstream.
What actually causes the inefficiency
The bottleneck comes from generating the full Cartesian space of customer pairs before you have any evidence that a given pair co-purchased something. You pay the cost up front and only then filter. When the number of customers grows, this explodes, even though most pairs may never overlap on any StockCode.
Efficient approach: self-join on StockCode
Instead, filter at the source by joining the table to itself on StockCode. This way you only ever materialize pairs that share at least one product. To avoid mirrored duplicates like (A, B) and (B, A), keep a single ordering.
pairs_df = (
tx.lazy()
.join_where(
tx.lazy(),
(pl.col("StockCode") == pl.col("StockCode_right")) &
(pl.col("CustomerID") < pl.col("CustomerID_right"))
)
.select("^CustomerID.*$")
.unique()
.collect(engine="streaming")
)This yields exactly the customer pairs that share one or more StockCode entries, once per pair. The comparison CustomerID < CustomerID_right ensures uniqueness by enforcing an order. The select with the regex keeps both CustomerID columns, and unique removes any duplicates that might surface when multiple products are shared by the same pair. Using collect with engine="streaming" executes in streaming mode.
┌────────────┬──────────────────┐
│ CustomerID ┆ CustomerID_right │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪══════════════════╡
│ A ┆ C │
│ A ┆ B │
└────────────┴──────────────────┘Why this matters
Downstream tasks like computing pairwise similarity (dot product, cosine, or any co-purchase metric) benefit massively from having only relevant pairs on input. By using a self-join keyed on StockCode, you prune the dataset early and avoid the all-pairs combinatorial expansion. This pattern keeps both time and memory in check as the data grows, and it aligns with Polars’ strengths in lazy query planning and efficient joins.
Takeaways
When you need pairs that share a property, do not build the entire universe of combinations. Push the constraint into the data pipeline with a self-join on the shared key and enforce ordering to keep each pair only once. In Polars, a lazy self-join with a selective projection and a final unique step gives you exactly the signal you want, ready to feed into your similarity computation with optional thresholds applied later.