2025, Nov 15 01:00

Fix Pandas 'Unable to parse string "nan"' after upgrade: add regex=True in str.replace and drop NA after melt

Resolve Pandas 2.x ValueError 'Unable to parse string "nan"' by restoring regex behavior in str.replace(regex=True) and dropping NA after melt for to_numeric.

Turning a quarterly spreadsheet into a clean, two-column Pandas dataframe should be a routine step. Yet after a library upgrade, the same code can suddenly crash with a ValueError about parsing "nan". If that sounds familiar, the root cause is not your data—it’s a subtle change in Pandas that flips the behavior of a key string operation.

What we are trying to do

The source is a wide table with numbered columns. The goal is a long, tidy dataframe with two columns: list_number and item_code, one row per value. After reshaping, any empty entries must be discarded so everything converts cleanly to numeric.

The code that started failing

Here’s a distilled version of the transformation pipeline as it was originally used. It reshapes the dataframe, strips non-digits, and converts to numbers.

frame_raw = source_df
frame_raw.dropna(axis=1, how='all', inplace=True)
frame_long = pd.melt(frame_raw, var_name='LIST_NUM', value_name='ITEM_ID')
frame_long['LIST_NUM'] = pd.to_numeric(
    frame_long['LIST_NUM'].astype(str).str.replace(r'[^\d]', ''),
    errors='raise')
frame_long['ITEM_ID'] = pd.to_numeric(
    frame_long['ITEM_ID'].astype(str).str.replace(r'[^\d]', ''),
    errors='raise')

After an upgrade to Pandas 2.1.4 (from a pre-2.0 version), this began raising ValueError: Unable to parse string "nan". Interestingly, the same code still worked for a colleague on Pandas 1.4.4.

What actually broke and why

The failure is caused by a backward-incompatible default change in Pandas 2.0.0. In that release, the default for the regex parameter of Series.str.replace() switched from True to False. Because the code doesn’t specify regex=True, the pattern r"[^\d]" is no longer treated as a regular expression. That means the replacement step stops removing non-digits, and critically, after astype(str), a missing value becomes the literal string "nan". With the unwanted text still present, to_numeric(..., errors='raise') hits "nan" and throws.

There’s a second source of confusion. The dropna call in the original pipeline uses axis=1, how='all', which removes only columns that are entirely NA. It doesn’t eliminate individual NA entries in the melted item column. After reshaping, those entries are still there, and after astype(str) they become "nan" strings. In other words, melt isn’t the problem; the cleaning order and the str.replace default are.

The fix: make regex behavior explicit

The minimal, targeted fix is to pass regex=True to str.replace so the pattern is interpreted as intended in all Pandas versions.

frame_raw = source_df
frame_raw.dropna(axis=1, how='all', inplace=True)
frame_long = pd.melt(frame_raw, var_name='LIST_NUM', value_name='ITEM_ID')
frame_long['LIST_NUM'] = pd.to_numeric(
    frame_long['LIST_NUM'].astype(str).str.replace(r'[^\d]', '', regex=True),
    errors='raise')
frame_long['ITEM_ID'] = pd.to_numeric(
    frame_long['ITEM_ID'].astype(str).str.replace(r'[^\d]', '', regex=True),
    errors='raise')

This change restores the pre-2.0 semantics of str.replace and prevents "nan" strings from slipping into the numeric conversion.

An order that avoids stray NaNs downstream

If the intent is to drop empty entries from the item column after reshaping, do that right after melt. This aligns the cleaning step with where the missing values actually appear.

base_df = source_df
reshaped = pd.melt(base_df, var_name='LIST_NUM', value_name='ITEM_ID').dropna(subset=['ITEM_ID'])
reshaped['LIST_NUM'] = pd.to_numeric(
    reshaped['LIST_NUM'].astype(str).str.replace(r'[^\d]', '', regex=True),
    errors='raise')
reshaped['ITEM_ID'] = pd.to_numeric(
    reshaped['ITEM_ID'].astype(str).str.replace(r'[^\d]', '', regex=True),
    errors='raise')

This sequencing removes the problematic NA rows before any string conversion, which keeps them from turning into the literal text "nan".

Why this matters

Version upgrades can silently flip defaults on widely used APIs, and pipelines built on implicit behavior can start producing errors or, worse, subtle data issues. Making intent explicit in keyword arguments, especially around text processing and type conversion, is a straightforward way to keep code stable across environments. Clear variable naming also helps; distinguishing the original wide dataframe from its melted version avoids confusion when reading and debugging transformation chains.

Takeaways

If a once-stable Pandas workflow suddenly fails after an upgrade with a ValueError around parsing "nan", check any use of .str.replace() that relies on regex syntax and add regex=True. Ensure that NA handling happens at the stage where the missing values actually exist—in this case, after melt—so they don’t get converted into strings that break numeric parsing. Finally, when investigating environment-specific behavior, isolate the pipeline into a minimal reproducible example; that makes version-related differences much easier to spot and correct.