2025, Oct 17 02:00
Filter PySpark arrays of structs by per-row IDs: use array_contains in expr, avoid Python's in
Learn how to filter PySpark arrays of structs by per-row IDs using array_contains inside expr. Avoid Python's in, prevent syntax errors, keep it in Spark SQL.
Filtering array columns of Structs in PySpark is a common need: you have an array of events per row and a companion array of ids that marks which events to keep. The sticking point appears when trying to use the Python in operator inside expr; it throws a syntax error instead of doing what you expect.
Problem setup
Imagine a DataFrame with a column activities of type array<struct<id, time>> and another column wanted_ids that stores the list of ids to keep per row. When the filter condition references a single integer id, the pattern works smoothly.
from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_time',
    Fn.expr('filter(activities, it -> it.id == wanted_id)').getField('time')
)
But as soon as wanted_id becomes an array column and you try to switch == to in, the expression parser fails.
from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_events',
    Fn.expr('filter(activities, it -> it.id in wanted_ids)')
)
Why this breaks
The expr function does not translate the Python in operator into a valid Spark SQL construct for array membership checks. In Spark SQL, the idiomatic way to test whether an element exists in an array is array_contains. Using in in this context leads to a syntax error rather than a valid predicate.
The fix
Use array_contains inside the filter lambda within expr. This keeps the logic per-row, compares each struct.id against the array in the same row, and returns only matching events.
from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_events',
    Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))')
)
If you only need the time field of the filtered events, project it from the resulting array of structs.
from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_times',
    Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))').getField('time')
)
Why this matters
Using array_contains is the reliable way to express membership checks on array columns inside expr. It avoids parser errors, keeps the logic entirely in Spark SQL, and cleanly handles row-wise filtering of array<struct> data.
Takeaways
When filtering an array of structs by a per-row list of ids, avoid the Python in operator inside expr. Lean on array_contains within filter to express the condition directly in Spark SQL. If you need a specific field from the filtered structs, access it with getField on the result.