https://pytroubles.com/en/posts/id894-pyspark-filter-array-of-structs-by-per-row-ids-in-expr-using-array-contains-not-python-in

PySpark: Filter array of Structs by per-row IDs in expr using array_contains, not Python in

Filter PySpark arrays of structs by per-row IDs: use array_contains in expr, avoid Python's in

PySpark: Filter array of Structs by per-row IDs in expr using array_contains, not Python in

Learn how to filter PySpark arrays of structs by per-row IDs using array_contains inside expr. Avoid Python's in, prevent syntax errors, keep it in Spark SQL.

2025-10-17T02:00:06+03:00

Filtering array columns of Structs in PySpark is a common need: you have an array of events per row and a companion array of ids that marks which events to keep. The sticking point appears when trying to use the Python in operator inside expr; it throws a syntax error instead of doing what you expect.Problem setupImagine a DataFrame with a column activities of type array<struct<id, time>> and another column wanted_ids that stores the list of ids to keep per row. When the filter condition references a single integer id, the pattern works smoothly.from pyspark.sql import functions as Fn dataset = dataset.withColumn( 'kept_time', Fn.expr('filter(activities, it -> it.id == wanted_id)').getField('time') ) But as soon as wanted_id becomes an array column and you try to switch == to in, the expression parser fails.from pyspark.sql import functions as Fn dataset = dataset.withColumn( 'kept_events', Fn.expr('filter(activities, it -> it.id in wanted_ids)') ) Why this breaksThe expr function does not translate the Python in operator into a valid Spark SQL construct for array membership checks. In Spark SQL, the idiomatic way to test whether an element exists in an array is array_contains. Using in in this context leads to a syntax error rather than a valid predicate.The fixUse array_contains inside the filter lambda within expr. This keeps the logic per-row, compares each struct.id against the array in the same row, and returns only matching events.from pyspark.sql import functions as Fn dataset = dataset.withColumn( 'kept_events', Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))') ) If you only need the time field of the filtered events, project it from the resulting array of structs.from pyspark.sql import functions as Fn dataset = dataset.withColumn( 'kept_times', Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))').getField('time') ) Why this mattersUsing array_contains is the reliable way to express membership checks on array columns inside expr. It avoids parser errors, keeps the logic entirely in Spark SQL, and cleanly handles row-wise filtering of array<struct> data.TakeawaysWhen filtering an array of structs by a per-row list of ids, avoid the Python in operator inside expr. Lean on array_contains within filter to express the condition directly in Spark SQL. If you need a specific field from the filtered structs, access it with getField on the result.

pyspark, filter array of structs, array_contains, expr, spark sql, array membership, filter by ids, dataframe, array column filtering, filter lambda, getField, events array, wanted_ids

2025

2025, Oct 17 02:00

Filter PySpark arrays of structs by per-row IDs: use array_contains in expr, avoid Python's in

Learn how to filter PySpark arrays of structs by per-row IDs using array_contains inside expr. Avoid Python's in, prevent syntax errors, keep it in Spark SQL.

Problem setup

Imagine a DataFrame with a column activities of type array<struct<id, time>> and another column wanted_ids that stores the list of ids to keep per row. When the filter condition references a single integer id, the pattern works smoothly.

from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_time',
    Fn.expr('filter(activities, it -> it.id == wanted_id)').getField('time')
)

But as soon as wanted_id becomes an array column and you try to switch == to in, the expression parser fails.

from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_events',
    Fn.expr('filter(activities, it -> it.id in wanted_ids)')
)

Why this breaks

The expr function does not translate the Python in operator into a valid Spark SQL construct for array membership checks. In Spark SQL, the idiomatic way to test whether an element exists in an array is array_contains. Using in in this context leads to a syntax error rather than a valid predicate.

The fix

Use array_contains inside the filter lambda within expr. This keeps the logic per-row, compares each struct.id against the array in the same row, and returns only matching events.

from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_events',
    Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))')
)

If you only need the time field of the filtered events, project it from the resulting array of structs.

from pyspark.sql import functions as Fn
dataset = dataset.withColumn(
    'kept_times',
    Fn.expr('filter(activities, it -> array_contains(wanted_ids, it.id))').getField('time')
)

Why this matters

Using array_contains is the reliable way to express membership checks on array columns inside expr. It avoids parser errors, keeps the logic entirely in Spark SQL, and cleanly handles row-wise filtering of array<struct> data.

Takeaways

When filtering an array of structs by a per-row list of ids, avoid the Python in operator inside expr. Lean on array_contains within filter to express the condition directly in Spark SQL. If you need a specific field from the filtered structs, access it with getField on the result.

The article is based on a question from StackOverflow by Nourless and an answer by jei.

pyspark python