2026, Jan 01 05:00

How to merge pandas DataFrames safely when files are missing: avoid UnboundLocalError with a clean list-based pattern

Learn a robust pandas pattern to merge DataFrames by common columns, skip missing files, and avoid UnboundLocalError using list-based loading and concatenation.

When you merge multiple pandas dataframes by intersecting their columns, the smallest slip—like a missing file—can surface as an UnboundLocalError. The pattern is familiar: you guard assignments with os.path.exists, but still reference a dataframe that was never created. The goal is to avoid brittle if-else chains, gracefully skip absent inputs, and keep the merge logic concise.

Problem setup

The following snippet illustrates the situation. If any file is missing, a dataframe might never be assigned, yet is still referenced downstream, causing a crash.

if os.path.exists(path_a):
    frame_a = pd.read_csv(path_a, header=None, names=colnames_a, sep=",", index_col=None)
if os.path.exists(path_b):
    frame_b = pd.read_csv(path_b, header=None, names=colnames_b, sep=",", index_col=None)
if os.path.exists(path_c):
    frame_c = pd.read_csv(path_c, header=None, names=colnames_c, sep=",", index_col=None)

shared_fields = frame_a.columns.intersection(frame_b.columns).intersection(frame_c.columns)
trim_a = frame_a[shared_fields]
trim_b = frame_b[shared_fields]
trim_c = frame_c[shared_fields]
stacked = pd.concat([trim_a, trim_b, trim_c], ignore_index=True)

What’s going wrong and why

The error UnboundLocalError: local variable 'df_2' referenced before assignment surfaces when a file is absent, so the corresponding dataframe variable is never bound. Despite guarding the reads with os.path.exists, the later operations unconditionally access those variables. The core issue isn’t the intersection itself; it’s referencing names that might not exist due to earlier conditional branches. Repetition across three nearly identical blocks makes this more error-prone and harder to extend.

1) Do you just want the final result or is it somehow important that the intermediate data be in data frames? 2) better append to list instead of using separated variables. And later you can use for-loop to work with elements on list.

A cleaner pattern that avoids the trap

Instead of managing separate variables, load only the files that actually exist into a single list. Then compute the common columns across whatever was read, and concatenate only those aligned slices. This removes conditional gaps and keeps the flow linear and safe.

paths = [path_a, path_b, path_c]
schemas = [colnames_a, colnames_b, colnames_c]

tables = [pd.read_csv(p, header=None, names=s, sep=",") for p, s in zip(paths, schemas) if os.path.exists(p)]

if tables:
    overlap = set.intersection(*(set(t.columns) for t in tables))
    result = pd.concat([t[list(overlap)] for t in tables], ignore_index=True)
else:
    result = pd.DataFrame()

This approach eliminates UnboundLocalError by never referencing a dataframe that wasn’t created. The intersection is computed across the dataframes that actually exist, and an empty input naturally yields an empty result.

Why this detail matters

In data workflows that depend on variable file availability, defensive structure prevents fragile code paths. Reading into a list, deriving the intersection from what’s present, and concatenating in one step make the behavior predictable and easier to reason about. It also reduces boilerplate and repetition, which directly lowers the chance of subtle runtime errors.

Practical takeaways

Gate file reads by existence once, collect the loaded dataframes in a list, derive the common columns via a single set.intersection across that list, and handle the no-input case explicitly. The end result remains the same—a concatenated dataframe aligned on shared columns—without the risk of referencing variables that were never assigned.

With this structure in place, missing inputs become a first-class case in your pipeline rather than a surprise crash at runtime.