2025, Oct 17 02:00

Filter PySpark arrays of structs by per-row IDs: use array_contains in expr, avoid Python's in

Learn how to filter PySpark arrays of structs by per-row IDs using array_contains inside expr. Avoid Python's in, prevent syntax errors, keep it in Spark SQL.

Filtering array columns of Structs in PySpark is a common need: you have an array of events per row and a companion array of ids that marks which events to keep. The sticking point appears when trying to use the Python in operator inside expr; it throws a syntax error instead of doing what you expect.

Problem setup

Imagine a DataFrame with a column activities of type array<struct<id, time>> and another column wanted_ids that stores the list of ids to keep per row. When the filter condition references a single integer id, the pattern works smoothly.

from pyspark.sql import functions as Fn

dataset = dataset.withColumn(
    'kept_time',
    Fn.expr('filter(activities, it -> it.id == wanted_id)').getField('time')
)

But as soon as wanted_id becomes an array column and you try to switch == to in, the expression parser fails.

from pyspark.sql import functions as Fn

dataset = dataset.withColumn(
    'kept_events',
    Fn.expr('filter(activities, it -> it.id in wanted_ids)')
)

Why this breaks

The expr function does not translate the Python in operator into a valid Spark SQL construct for array membership checks. In Spark SQL, the idiomatic way to test whether an element exists in an array is array_contains. Using in in this context leads to a syntax error rather than a valid predicate.

The fix

Use array_contains inside the filter lambda within expr. This keeps the logic per-row, compares each struct.id against the array in the same row, and returns only matching events.

from pyspark.sql import functions as Fn

dataset = dataset.withColumn(
    'kept_events',
    Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))')
)

If you only need the time field of the filtered events, project it from the resulting array of structs.

from pyspark.sql import functions as Fn

dataset = dataset.withColumn(
    'kept_times',
    Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))').getField('time')
)

Why this matters

Using array_contains is the reliable way to express membership checks on array columns inside expr. It avoids parser errors, keeps the logic entirely in Spark SQL, and cleanly handles row-wise filtering of array<struct> data.

Takeaways

When filtering an array of structs by a per-row list of ids, avoid the Python in operator inside expr. Lean on array_contains within filter to express the condition directly in Spark SQL. If you need a specific field from the filtered structs, access it with getField on the result.

The article is based on a question from StackOverflow by Nourless and an answer by jei.

pyspark python