2025, Nov 14 01:00

Align two pandas DataFrames with duplicate MultiIndex keys: a diff-like view with cumcount, concat, compare

Get diff-like row alignment in pandas for duplicate MultiIndex DataFrames: add per-key cumcount, outer concat, and compare to reveal gaps and mismatches.

Data wrangling often needs a diff-like view: two tables with mostly similar rows, but with gaps, repeats, and partial overlaps. The goal is to align them row-for-row, the way a text comparison tool would, so matching entries line up and extra rows are visibly offset with NaN. Standard joins don’t produce this alignment when the indices are duplicated; instead, you get cartesian expansions or mismatched placement of missing values.

Minimal example: the setup and why naive joins fall short

The following sample builds two pandas DataFrames with a triple index of year, pos, and score. They share some keys and differ on others, and both contain duplicate index entries.

import pandas as pd

payload_a = {"year": [2023] * 7,
             "pos": [1, 2, 4, 4, 4, 8, 8],
             "score": [15, 20, 30, 30, 30, 60, 60],
             "value": ["a", "b", "c", "c", "c", "d", "d"]}
left_frame = pd.DataFrame(payload_a).set_index(["year", "pos", "score"])

payload_b = {"year": [2023] * 9,
             "pos": [1, 1, 1, 3, 3, 4, 4, 8, 10],
             "score": [15, 15, 15, 25, 25, 30, 30, 60, 80],
             "value": ["v", "v", "v", "w", "w", "x", "x", "y", "z"]}
right_frame = pd.DataFrame(payload_b).set_index(["year", "pos", "score"])

joined = left_frame.join(right_frame, how="outer", lsuffix="_l", rsuffix="_r")
print(joined)

l_aligned, r_aligned = left_frame.align(right_frame)
print(l_aligned)
print(r_aligned)

A plain outer join duplicates entries in the presence of repeated index keys and does not create the diff-like row alignment with empty slots where appropriate. align also isn’t meant for this—it only aligns on keys, not by expanding to match counts of duplicates on both sides.

What’s actually happening

pandas operations assume that an index uniquely identifies a row. When keys repeat, joins and aligns behave correctly by their definitions, but they don’t mimic a text diff. You need a way to make each repeated key distinguishable so the Nth occurrence on the left lines up with the Nth occurrence on the right.

Solution 1: add a per-key counter to make rows uniquely addressable, then concat

The strategy is straightforward. Within each key tuple (year, pos, score), assign a 0-based sequence number to each occurrence. Append that sequence as a new index level. Now both sides have unique index keys, and a simple outer concat produces the desired alignment. Finally, sort for readability.

left_unique = left_frame.set_index(
    left_frame.groupby(level=[0, 1, 2]).cumcount(),
    append=True
)

right_unique = right_frame.set_index(
    right_frame.groupby(level=[0, 1, 2]).cumcount(),
    append=True
)

aligned_view = pd.concat([left_unique, right_unique], axis=1, join='outer').sort_index()
print(aligned_view)

Output:

                 value value
year pos score              
2023 1   15    0     a     v
               1   NaN     v
               2   NaN     v
     2   20    0     b   NaN
     3   25    0   NaN     w
               1   NaN     w
     4   30    0     c     x
               1     c     x
               2     c   NaN
     8   60    0     d     y
               1     d   NaN
     10  80    0   NaN     z

This is the diff-like alignment we want: repeated keys are expanded in order, matching positions line up, and extras are surfaced as NaN on the opposite side.

Solution 2: diff-style display using DataFrame.compare with unified indexing

If you prefer an explicit side-by-side difference view, reindex both uniquely keyed frames to the union of their multi-indices and then call compare.

full_index = left_unique.index.union(right_unique.index)
result = left_unique.reindex(full_index).compare(right_unique.reindex(full_index))
print(result)

Output:

                 value      
                  self other
year pos score              
2023 1   15    0     a     v
               1   NaN     v
               2   NaN     v
     2   20    0     b   NaN
     3   25    0   NaN     w
               1   NaN     w
     4   30    0     c     x
               1     c     x
               2     c   NaN
     8   60    0     d     y
               1     d   NaN
     10  80    0   NaN     z

This keeps the alignment and highlights where values differ or are missing between the two inputs.

Why this matters

When you are auditing pipelines, reconciling outputs, or building regression checks, a human-readable, row-aligned comparison reduces mental overhead. You can spot mismatches, extras, and missing entries at a glance without reverse-engineering cartesian expansions caused by duplicate keys.

Wrap-up

To get a diff-like alignment with duplicate multi-index keys, assign a per-key occurrence counter as an additional index level and then combine frames with an outer concat. For an inspection-friendly view, reindex both sides to the same full multi-index and use compare. In quick checks, the first approach tends to be faster, though this can vary with dataframe size. The key idea is simple: make each repeated key unique in order, then let pandas do what it does best—aligned operations on well-defined indices.