https://pytroubles.com/en/posts/id388-pandas-pivot-table-includes-unused-categories-how-to-fix-ghost-columns-with-observed-true

Pandas pivot_table includes unused categories: how to fix ghost columns with observed=True

Stop extra pivot columns in Pandas: exclude unused categorical values with observed=True

Pandas pivot_table includes unused categories: how to fix ghost columns with observed=True

Learn why Pandas pivot_table shows unused categories and how observed=True removes ghost columns. Understand categorical dtype behavior and the default change.

2025-10-02T11:00:05+03:00

2025-10-02T11:00:06+03:00

When you filter a pandas DataFrame to a subset of categorical values and then build a pivot table, it’s easy to be surprised by columns for categories you thought were gone. This often shows up with game platform data: you keep only a handful of platforms, but pivot_table still generates columns for every platform ever seen in the original data.Reproducing the issueSuppose you start from a games dataset with many platforms and reduce it to the six you care about. Everything looks correct on the filtered DataFrame, yet the pivot still brings back the whole zoo of categories.# Original dataframe df_games_src['platform'].unique() # New dataframe with only the six target platforms df_games_filtered = df_games_src[df_games_src['platform'].isin( ['3DS', 'PSV', 'WiiU', 'PS4', 'XOne', 'PC'])].sort_values(by=['year', 'platform']).reset_index(drop=True) df_games_filtered['platform'].unique() # Pivot table creation (unexpected extra columns appear) subset_sales = df_games_filtered[['name', 'platform', 'total_sales']] sales_wide = subset_sales.pivot_table( values='total_sales', index='name', columns='platform', aggfunc='sum') sales_wideThe result shows a table with columns for platforms far beyond the six selected, even though the filtered DataFrame’s unique values list only those six.What’s going onThis behavior is tied to pandas’ categorical dtype. By default, various methods, including pivot_table, will include unused categories in the output. That means categories that exist in the dtype but don’t appear in the actual rows of your filtered data still make it into the result. The pandas user guide covers this in the section on categorical operations: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#operations.The fixThe most direct way to tell pivot_table to ignore unused categories is to set observed=True. With that flag, only categories actually present in the filtered data are considered when building the output.subset_sales = df_games_filtered[['name', 'platform', 'total_sales']] sales_wide = subset_sales.pivot_table( values='total_sales', index='name', columns='platform', aggfunc='sum', observed=True) sales_wideThis produces a pivot that includes only the six platforms in your filter.Minimal illustrationHere’s a compact example that shows the difference between the default behavior and observed=True.import pandas as pd # Base table with four categories base_tbl = pd.DataFrame({ 'name': ['a', 'b', 'c', 'd'], 'platform': ['w', 'x', 'y', 'z'], 'total_sales': [1, 2, 3, 4] }) # Make 'platform' categorical base_tbl = base_tbl.astype({'platform': 'category'}) # Keep only rows where platform is x or z narrow_tbl = base_tbl[base_tbl['platform'].isin(['x', 'z'])] # Default: unused categories appear in the pivot narrow_tbl.pivot_table(index='name', columns='platform', values='total_sales', aggfunc='sum') # platform w x y z # name # b 0 2 0 0 # d 0 0 0 4 # With observed=True: only used categories are included narrow_tbl.pivot_table(index='name', columns='platform', values='total_sales', aggfunc='sum', observed=True) # platform x z # name # b 2.0 NaN # d NaN 4.0Why this mattersBeyond making outputs cleaner and easier to read, there is also an impending behavioral change. Your code should actually warn (on at least v2.3.2) as the default will change. The message is explicit:FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behaviorBeing explicit with observed=True aligns your code with the future default and prevents surprises as pandas evolves.Wrap-upIf you filter a categorical column and still see ghost categories in pivot outputs, it’s not your filter, it’s the default categorical handling. Make pivot_table consider only the categories that actually occur by passing observed=True. This keeps your tables focused on the data you intentionally kept and prepares your code for the upcoming default change.

pandas pivot_table, observed=True, unused categories, categorical dtype, ghost columns, filter categorical values, pivot table pandas, FutureWarning, observed default change, pandas categories

2025

2025, Oct 02 11:00

Stop extra pivot columns in Pandas: exclude unused categorical values with observed=True

Learn why Pandas pivot_table shows unused categories and how observed=True removes ghost columns. Understand categorical dtype behavior and the default change.

Reproducing the issue

Suppose you start from a games dataset with many platforms and reduce it to the six you care about. Everything looks correct on the filtered DataFrame, yet the pivot still brings back the whole zoo of categories.

# Original dataframe
df_games_src['platform'].unique()
# New dataframe with only the six target platforms
df_games_filtered = df_games_src[df_games_src['platform'].isin(
    ['3DS', 'PSV', 'WiiU', 'PS4', 'XOne', 'PC'])].sort_values(by=['year', 'platform']).reset_index(drop=True)
df_games_filtered['platform'].unique()
# Pivot table creation (unexpected extra columns appear)
subset_sales = df_games_filtered[['name', 'platform', 'total_sales']]
sales_wide = subset_sales.pivot_table(
    values='total_sales', index='name', columns='platform', aggfunc='sum')
sales_wide

The result shows a table with columns for platforms far beyond the six selected, even though the filtered DataFrame’s unique values list only those six.

What’s going on

This behavior is tied to pandas’ categorical dtype. By default, various methods, including pivot_table, will include unused categories in the output. That means categories that exist in the dtype but don’t appear in the actual rows of your filtered data still make it into the result. The pandas user guide covers this in the section on categorical operations: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#operations.

The fix

The most direct way to tell pivot_table to ignore unused categories is to set observed=True. With that flag, only categories actually present in the filtered data are considered when building the output.

subset_sales = df_games_filtered[['name', 'platform', 'total_sales']]
sales_wide = subset_sales.pivot_table(
    values='total_sales', index='name', columns='platform', aggfunc='sum', observed=True)
sales_wide

This produces a pivot that includes only the six platforms in your filter.

Minimal illustration

Here’s a compact example that shows the difference between the default behavior and observed=True.

import pandas as pd
# Base table with four categories
base_tbl = pd.DataFrame({
    'name': ['a', 'b', 'c', 'd'],
    'platform': ['w', 'x', 'y', 'z'],
    'total_sales': [1, 2, 3, 4]
})
# Make 'platform' categorical
base_tbl = base_tbl.astype({'platform': 'category'})
# Keep only rows where platform is x or z
narrow_tbl = base_tbl[base_tbl['platform'].isin(['x', 'z'])]
# Default: unused categories appear in the pivot
narrow_tbl.pivot_table(index='name', columns='platform', values='total_sales', aggfunc='sum')
# platform  w  x  y  z
# name
# b         0  2  0  0
# d         0  0  0  4
# With observed=True: only used categories are included
narrow_tbl.pivot_table(index='name', columns='platform', values='total_sales', aggfunc='sum', observed=True)
# platform    x    z
# name
# b         2.0  NaN
# d         NaN  4.0

Why this matters

Beyond making outputs cleaner and easier to read, there is also an impending behavioral change. Your code should actually warn (on at least v2.3.2) as the default will change. The message is explicit:

FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior

Being explicit with observed=True aligns your code with the future default and prevents surprises as pandas evolves.

Wrap-up

If you filter a categorical column and still see ghost categories in pivot outputs, it’s not your filter, it’s the default categorical handling. Make pivot_table consider only the categories that actually occur by passing observed=True. This keeps your tables focused on the data you intentionally kept and prepares your code for the upcoming default change.

The article is based on a question from StackOverflow by RicardoDLM and an answer by jqurious.

categories dataframe isin pandas python