2025, Oct 04 05:00

Flatten a pandas column of dicts into a clean id-schedule_name mapping with a simple stack pipeline

Learn how to flatten nested JSON dict columns in pandas into a tidy two-column table (id, schedule_name) using DataFrame, stack, and deduplication. fast.

Flattening nested JSON into a clean, analysis-ready table is a recurring task. A common scenario: a pandas column holds dictionaries where keys are identifiers and values are human-readable labels. The goal is to turn every key–value pair across all rows into a tidy two-column dataframe with id and schedule_name. Doing this correctly across the entire column, not just for a single row, is what often trips people up.

The starting point

Suppose a dataframe column contains dictionaries with schedule mappings per row. Conceptually it looks like this:

tbl['schedules']
0  {'3263524': 'Group 1 CORE DAYS', '3263525': 'Group 1 CORE NIGHTS', '3263526': 'Group 1 EDUCATION', '3263527': 'Group 1 ROUNDING'}
1  {'3263524': 'Group 1 CORE DAYS', '3881368': 'VS Days', '3881370': 'VS Education Shift A', '3881455': 'VS Education Shift B'}

The desired long-form dataframe contains one row per pair drawn from any row’s dictionary:

id        schedule_name
3263524   Group 1 CORE DAYS
3263525   Group 1 CORE NIGHTS
3263526   Group 1 EDUCATION
3263527   Group 1 ROUNDING
3881368   VS Days
3881370   VS Education Shift A
3881455   VS Education Shift B

A naïve attempt that falls short

It’s tempting to point from_dict at one of the dictionaries, but that only handles a single row and doesn’t shape the column names or index the way we need:

sched_map = pd.DataFrame.from_dict(tbl['schedules'].iloc[0], orient='index')

This approach ignores all other rows and still requires extra work to get the final column names and a consistent index.

What’s really going on

The clean solution starts by aligning all dictionary keys across rows as columns. Converting the series of dicts into a wide dataframe does exactly that, placing every unique key into its own column and using NaN when a key is missing in a particular row. From there, stacking pivots the data into a long format so each key–value pair becomes its own row. Bringing the stacked index level down into a column exposes the dictionary keys as real data. Renaming the two columns finishes the tidy shape, and removing duplicates ensures repeated pairs are collapsed.

The solution

This pandas-only pipeline puts it all together:

flat_schedules = (
    pd.DataFrame(tbl['schedules'].tolist())
      .stack().reset_index(level=1)
      .set_axis(['id', 'schedule_name'], axis=1)
      .drop_duplicates(ignore_index=True)
)

Output:

id         schedule_name
3263524    Group 1 CORE DAYS
3263525    Group 1 CORE NIGHTS
3263526    Group 1 EDUCATION
3263527    Group 1 ROUNDING
3881368    VS Days
3881370    VS Education Shift A
3881455    VS Education Shift B

Why this works, step by step

The first step, DataFrame(tbl['schedules'].tolist()), materializes a wide table by aligning keys as columns. You can see the effect clearly when inspecting that intermediate:

3263524              3263525            3263526  \
0  Group 1 CORE DAYS  Group 1 CORE NIGHTS  Group 1 EDUCATION   
1  Group 1 CORE DAYS                  NaN                NaN   

            3263527  3881368               3881370               3881455  
0  Group 1 ROUNDING      NaN                   NaN                   NaN  
1               NaN  VS Days  VS Education Shift A  VS Education Shift B  

Applying stack() flips this wide matrix into a long index–value series where each row corresponds to a single key–value pair, including pairs from every original row:

0  3263524       Group 1 CORE DAYS
   3263525     Group 1 CORE NIGHTS
   3263526       Group 1 EDUCATION
   3263527        Group 1 ROUNDING
1  3263524       Group 1 CORE DAYS
   3881368                 VS Days
   3881370    VS Education Shift A
   3881455    VS Education Shift B

reset_index(level=1) moves the dict keys from the stacked index into a regular column. set_axis(['id', 'schedule_name'], axis=1) labels the two columns. drop_duplicates(ignore_index=True) removes any repeated pairs and reindexes the result from zero.

Why this detail matters

When ingesting nested JSON, landing it in a tidy, predictable schema upfront prevents downstream friction. Properly aligning keys across rows before pivoting avoids silent data loss and weird corner cases during joins, filters, or aggregations. A compact pipeline that converts from a column of dicts to a clean long-form table is easier to reason about, test, and reuse.

Takeaways

Use a two-phase reshape: first go wide by aligning dictionary keys as columns, then go long with stack to emit one row per key–value pair. Finish by exposing keys as a column, applying meaningful column names, and deduplicating repeated pairs. This sequence yields a reliable id and schedule_name mapping you can feed into the rest of your pandas workflow with confidence.

The article is based on a question from StackOverflow by skohrs and an answer by PaulS.