2025, Nov 01 11:00
Row-wise comparisons in Polars: handling None semantics with is_in, struct keys, and multi-key joins
Learn how Polars handles row-wise comparisons with nulls: is_in vs struct and multi-key joins, plus how to get pandas-like behavior with explicit rules.
Row-wise comparisons across DataFrames look deceptively simple until nulls enter the picture. In Polars, the default behavior for None can differ depending on whether you use is_in with a struct or a multi-key join. If you expect pandas-style semantics where None never matches, you need to be explicit about it. This guide shows the mismatch, explains what’s going on, and demonstrates how to force consistent behavior.
Reproducing the mismatch
The setup uses a full table and its subset. We will compare rows across the two, including cases where a key column contains None.
import polars as pl
rec1 = {"foo": "a", "bar": "b", "baz": "c"}
rec2 = {"foo": "x", "bar": "y", "baz": "z"}
rec3 = {"foo": "a", "bar": "b", "baz": None}
rec4 = {"foo": "m", "bar": "n", "baz": "o"}
rec5 = {"foo": "x", "bar": "y", "baz": None}
rec6 = {"foo": "a", "bar": "b", "baz": None}
all_rows = [rec1, rec2, rec3, rec4, rec5, rec6]
subset_rows = [rec1, rec2, rec3]
key_cols = ["foo", "bar", "baz"]
df_all = pl.DataFrame(all_rows)
df_sub = pl.DataFrame(subset_rows)
key_struct = (
df_sub
.select(pl.struct(pl.col(key_cols)).alias("key_struct"))
.get_column("key_struct")
)
Using is_in on a struct of all columns marks row membership in the subset. Notice how rows with None in baz can come back as matched.
df_all.with_columns(
pl.struct(pl.all()).is_in(key_struct.implode()).alias("hit")
)
A plain multi-key join, however, does not match nulls and therefore returns only the fully non-null matches.
df_all.join(df_sub, on=key_cols)
Joining on a single struct key aligns with the is_in behavior and can match rows where the composite key contains None.
df_all.join(df_sub, on=pl.struct(key_cols))
Why the results differ
The discrepancy stems from how the operations interpret equality with nulls. A multi-key join on separate columns does not treat null as equal to null. In contrast, using a composite struct key for either is_in or join evaluates membership or equality over the entire struct, and in the example above this can result in rows with None being considered matches. The two approaches are therefore not interchangeable when nulls are present.
Version note on is_in and null handling
is_in changed recently in its null propagation. The following minimal example shows that a scalar null checked against a list containing null used to yield true and now yields null.
import polars as pl
tab = pl.select(a=None, b=[None])
tab = tab.cast({"a": pl.String, "b": pl.List(pl.String)})
print(tab.with_columns(c=pl.col.a.is_in("b")))
With polars 1.27.1 the result was:
shape: (1, 3)
┌──────┬───────────┬──────┐
│ a ┆ b ┆ c │
│ str ┆ list[str] ┆ bool │
╞══════╪═══════════╪══════╡
│ null ┆ [null] ┆ true │
└──────┴───────────┴──────┘
With polars 1.28.0 it became:
shape: (1, 3)
┌──────┬───────────┬──────┐
│ a ┆ b ┆ c │
│ str ┆ list[str] ┆ bool │
╞══════╪═══════════╪══════╡
│ null ┆ [null] ┆ null │
└──────┴───────────┴──────┘
For a nested left-hand side, such as a list containing a null compared against a list of lists containing a null, the example yields true:
import polars as pl
tab2 = pl.select(a=[None], b=[[None]])
tab2 = tab2.cast({"a": pl.List(pl.String), "b": pl.List(pl.List(pl.String))})
print(tab2.with_columns(c=pl.col.a.is_in("b")))
shape: (1, 3)
┌───────────┬─────────────────┬──────┐
│ a ┆ b ┆ c │
│ list[str] ┆ list[list[str]] ┆ bool │
╞═══════════╪═════════════════╪══════╡
│ [null] ┆ [[null]] ┆ true │
└───────────┴─────────────────┴──────┘
These examples show that null semantics in is_in are nuanced and version-dependent. When correctness depends on a specific interpretation, encode it explicitly.
Getting pandas-like behavior in Polars
By default, pandas treats None as not matching in a typical DataFrame.isin followed by all(axis=1) workflow. Reproducing that in Polars requires ruling out rows that contain nulls before checking row-wise membership.
import polars as pl
rec1 = {"foo": "a", "bar": "b", "baz": "c"}
rec2 = {"foo": "x", "bar": "y", "baz": "z"}
rec3 = {"foo": "a", "bar": "b", "baz": None}
rec4 = {"foo": "m", "bar": "n", "baz": "o"}
rec5 = {"foo": "x", "bar": "y", "baz": None}
rec6 = {"foo": "a", "bar": "b", "baz": None}
all_rows = [rec1, rec2, rec3, rec4, rec5, rec6]
subset_rows = [rec1, rec2, rec3]
key_cols = ["foo", "bar", "baz"]
df_all = pl.DataFrame(all_rows)
df_sub = pl.DataFrame(subset_rows)
key_struct = (
df_sub
.select(pl.struct(pl.col(key_cols)).alias("key_struct"))
.get_column("key_struct")
)
result = df_all.with_columns(
pl.all_horizontal(
pl.all().is_not_null(),
pl.struct(pl.all()).is_in(key_struct.implode())
).alias("hit")
)
print(result)
If you prefer to see the pandas baseline for comparison, the equivalent there requires no extra guard because None values do not match by default in this pattern:
import pandas as pd
pd_all = pd.DataFrame(all_rows)
pd_sub = pd.DataFrame(subset_rows)
pd_all["hit_like_pandas"] = pd_all[key_cols].isin(pd_sub).all(1)
print(pd_all)
The Polars expression uses all_horizontal to combine two conditions: first, that every column in the row is non-null, and second, that the whole row as a struct is a member of the subset’s struct series. This produces the same outcome as the pandas snippet for the provided data.
Why this matters
Cross-frame comparisons are fundamental for de-duplication, filtering, and integrity checks. Small differences in None semantics can surface as surprising mismatches or missing matches, depending on whether you reach for a multi-key join, a struct join, or is_in. Since is_in behavior around nulls has changed between versions, relying on implicit defaults can yield different results over time. Encoding the intended semantics—whether nulls should match or not—makes the logic robust.
Takeaways
Use a struct when you need true row-wise matching across multiple columns. Expect a multi-key join to drop matches where any key is null, and expect a struct-based approach to behave differently. If you want pandas-style behavior where None never counts as a match in this scenario, combine a non-null guard with struct-based is_in, as shown above. When results depend on null handling, make the rule explicit rather than relying on defaults.
The article is based on a question from StackOverflow by dewser_the_board and an answer by jqurious.