2025, Dec 19 11:00

How to Select Pandas DataFrame Columns by Substring: Build One Regex or Use a Python List Comprehension

Learn how to select pandas DataFrame columns by substring using a regex or a Python list comprehension. Avoid exact-match traps and filter columns reliably.

When you need to select pandas DataFrame columns by a set of substrings rather than exact names, a straightforward intersection won’t help. This is a common scenario in analytics pipelines: you have a list like ["date", "cat", "rabbit"] and want every column whose name contains these tokens, such as cats_fostered, cats_adopted, rabbits_fostered, and so on.

Problem setup

The goal is to filter columns by substrings from a list. Directly intersecting with column names fails because only exact matches are retained, and passing a list to regex-based filtering raises an error.

import pandas as pd
raw_map = {
    "date": ["2023-01-22","2023-11-16","2024-06-30","2024-08-16","2025-01-22"],
    "cats_fostered": [1,2,3,4,5],
    "cats_adopted": [1,2,3,4,5],
    "dogs_fostered": [1,2,3,4,5],
    "dogs_adopted": [1,2,3,4,5],
    "rabbits_fostered": [1,2,3,4,5],
    "rabbits_adopted": [1,2,3,4,5]
}
pet_df = pd.DataFrame(raw_map)
lookup_terms = ["date", "cat", "rabbit"]
# Does not work: intersection requires exact column name matches
pet_df[pet_df.columns.intersection(lookup_terms)]
# Does not work: filter(regex=...) expects a string pattern, not a list
pet_df.filter(regex=lookup_terms)

What’s actually going on

The core issue is a mismatch between exact matching and substring matching. The DataFrame has columns like cats_fostered and rabbits_adopted, but your list contains cat and rabbit. Using columns.intersection(...) keeps only column names that exactly equal elements in the list, which drops everything except date. Meanwhile, DataFrame.filter(regex=...) expects a single regex string, not a list of strings, so passing the list directly raises an error.

Solution

Build a single regex pattern from the substrings and hand it to filter. It’s convenient to keep date always and let the list focus on the actual animal keys. This way, you match any column that contains cat or rabbit anywhere in the name, plus the date column.

# Focus the list on animals only and hardcode "date" in the pattern
animal_keys = ["cat", "rabbit"]
regex_rule = "date|" + "|".join(animal_keys)
pet_df.filter(regex=regex_rule)

If you need only one animal, the same approach remains identical in structure, just adjust the list.

animal_keys = ["cat"]
regex_rule = "date|" + "|".join(animal_keys)
pet_df.filter(regex=regex_rule)

There is also a non-regex, string-operations approach that mirrors the same idea: assemble the list of columns to keep with a comprehension and then subset the frame. You can use inclusion anywhere in the column name.

animal_keys = ["cat", "rabbit"]
chosen_cols = [c for c in pet_df.columns if c == "date" or any(mark in c for mark in animal_keys)]
pet_df.loc[:, chosen_cols]

Why this matters

In data wrangling, column naming conventions often encode semantics through prefixes or infixes. Relying on exact matches is brittle and leads to empty selections or errors. Constructing a single regex or using simple string checks makes the intent explicit, scales well as the schema evolves, and avoids surprising failures when names include suffixes like _fostered or _adopted.

Takeaways

If you’re filtering columns by a list of tokens, treat the operation as substring matching. Build one regex string with "|".join(...) and pass it to DataFrame.filter(regex=...), optionally hardcoding stable columns like date. If regex isn’t your preference, derive the column list with a comprehension that checks in or startswith and subset with .loc. This keeps your selection logic concise, predictable, and easy to adapt as new similarly named columns show up.