2025, Sep 22 17:00
Filter pandas GroupBy by original index values: keep specific rows and compare group counts
Filter pandas grouped data by original index values, keep only needed rows, and count surviving groups—without misusing GroupBy.filter or doing a second groupby
Filtering grouped data in pandas by a subset of original index values sounds simple until you need both the unfiltered grouping and the filtered view, and you also want to compare how many groups survive after the filter. The catch is that DataFrameGroupBy.filter removes whole groups based on an aggregate predicate, while here the goal is to keep only specific rows inside each group by index and then reason about the resulting groups.
Problem setup
Assume a DataFrame with multiple grouping columns and a separate data column. The data is grouped by those columns, and there is a list of index labels that must be kept. The objective is to obtain the grouped view after filtering by index and, in some workflows, compare the number of groups before and after the filter without paying for a second groupby unless strictly necessary.
import pandas as pd
frame = pd.DataFrame(
    data={
        "g0": ["foo", "foo", "bar", "bar"],
        "g1": ["baz", "baz", "baz", "qux"],
        "data": [0.1, 0.3, 0.4, 0.2],
    },
    index=["a", "b", "c", "d"],
)
bunches = frame.groupby(by=["g0", "g1"], sort=False)
keep_idx = ["a", "b", "d"]
Why DataFrameGroupBy.filter is not the right tool here
DataFrameGroupBy.filter evaluates a predicate on each group and either keeps the entire group or drops it. In this task the requirement is different: retain only those rows whose original index is in a given list, potentially leaving some groups partially populated or empty. That rules out groupby.filter for this scenario.
Solution 1: Filter first, then group
If a filtered grouped view is all that’s needed, the most direct route is to subset the DataFrame by index and group the result. This yields the correct group counts after filtering because groups with no remaining rows simply don’t exist in the grouped object.
filtered_bunches = frame[frame.index.isin(keep_idx)].groupby(by=["g0", "g1"], sort=False)
This approach is straightforward and gives a clean grouped object that reflects only the retained indices.
Solution 2: Keep the original groups and derive a filtered view without regrouping
When the unfiltered grouping must be preserved and you want to avoid a second groupby, iterate over the existing groups and slice each group by the desired indices. This keeps the original grouping intact and produces per-group DataFrames filtered by index.
filtered_parts = [part[part.index.isin(keep_idx)] for _, part in bunches]
If you specifically need the number of groups that remain non-empty after filtering, ignore the empty slices. This mirrors the behavior of regrouping the filtered frame in terms of group count.
nonempty_parts = [
    part[part.index.isin(keep_idx)]
    for _, part in bunches
    if not part[part.index.isin(keep_idx)].empty
]
original_group_count = len(bunches)
filtered_group_count = len(nonempty_parts)
Understanding the group counts after filtering
Consider the example above. There are three groups before filtering. With keep_idx = ["a", "b", "d"], the group containing only index "c" disappears after filtering, so two groups remain. With keep_idx = ["a", "c", "d"], the group shared by "a" and "b" still exists because "a" is present, and the group for "c" is also present, so the total stays at three.
Why this matters
Index-based subsetting within grouped data comes up in pipelines where partial retention of groups is required, or when comparing the structure of the data before and after a filter. Using boolean indexing against the original index preserves the intent precisely. It also avoids the semantics of group-level filtering that would otherwise drop entire groups.
Takeaways
If you only need the filtered grouped view, filter by index first and group once. If you need to keep both the unfiltered groups and a filtered perspective without performing a second groupby, reuse the original GroupBy to slice each chunk. When the task is to compare the number of groups that survive the filter, count only the non-empty slices to match what a regroup on the filtered DataFrame would produce.