2025, Oct 23 13:00

Prevent pyodbc fast_executemany failures: normalize Pandas NaN and pd.NA to None for SQL Server batch inserts

pyodbc fast_executemany to SQL Server can fail with invalid float errors when NaN or pd.NA slip through. Convert missing values to None before inserts.

Batch inserts from Pandas to SQL Server are a great way to push large datasets quickly, until they suddenly aren’t. A common stumbling block appears when the data contains missing values in numeric columns: a batch with multiple rows fails, while a batch size of one slips through. The errors complain about invalid float values, and it looks like the driver is being overly picky. In reality, the issue is how missing values are represented on the Python side before they hit pyodbc.

Repro in a nutshell

The following minimal dataset shows the behavior. The integer-like column is made nullable via Pandas’ Int64 dtype, while other numeric columns are float with missing values represented as np.nan.

import pandas as pd
import numpy as np

sample = {
    "unique_id": [
        "String_3", "String_5", "String_10", "String_9", "String_4",
        "String_7", "String_2", "String_6", "String_1", "String_8"
    ],
    "entity_name": [
        "Alice", None, "Eve", "Alice", "Bob",
        "Alice", "Alice", None, "Charlie", "Eve"
    ],
    "entity_address": [
        "456 Elm St", "789 Oak St", "123 Main St", "456 Elm St", None,
        "456 Elm St", "123 Main St", None, "789 Oak St", "789 Oak St"
    ],
    "entity_age_code": [
        763.0, 349.0, np.nan, np.nan, 888.0,
        999.0, 711.0, 574.0, 963.0, 300.0
    ],
    "entity_height": [
        93.357616, 48.408745, 79.978718, 94.953377, 11.094891,
        np.nan, 33.282917, np.nan, 82.714043, np.nan
    ],
    "entity_weight": [
        19.158688, np.nan, 73.853124, 54.005774, 70.846664,
        70.996657, np.nan, 26.325328, 11.360588, 51.324372
    ]
}

frame = pd.DataFrame(sample)
frame["entity_age_code"] = cast_series_type(frame["entity_age_code"], "Int64")

The insert path builds a parameterized statement and uses pyodbc with fast_executemany. The problematic part is how the values are prepared per batch:

import pyodbc

pk_cols = ["unique_id"]
target_tbl = "DscEntityTable"
chunk_len = 3
cn_str = "xxxxxxx"
cnx = pyodbc.connect(cn_str)
cur = cnx.cursor()

frame = frame.drop_duplicates()
sql_types = deduce_mssql_types(frame)

cur.execute(f"SELECT OBJECT_ID('{target_tbl}', 'U')")
exists = cur.fetchone()[0] is not None

if exists:
    key_pairs = list(frame[pk_cols].dropna().itertuples(index=False, name=None))
    prior = fetch_existing_keys_chunked(cur, target_tbl, pk_cols, key_pairs)
    prior = pd.DataFrame(prior, columns=pk_cols)
    for k in pk_cols:
        kind = str(frame[k].dtype)
        prior[k] = cast_series_type(prior[k], kind)
    frame = pd.merge(frame, prior, how="outer", indicator=True)
    frame = frame[frame["_merge"] == "left_only"].drop(columns="_merge")
    if frame.empty:
        sys.exit("No new records to insert.")
else:
    cols_sql = ", ".join([
        f"[{c}] {sql_types[c]} COLLATE Latin1_General_100_CI_AS_SC_UTF8" if sql_types[c].startswith("VARCHAR") else f"[{c}] {sql_types[c]}"
        for c in frame.columns
    ])
    pk_sql = ", ".join(f"[{c}]" for c in pk_cols)
    create_stmt = f"CREATE TABLE {target_tbl} ({cols_sql}, PRIMARY KEY ({pk_sql}))"
    cur.execute(create_stmt)
    cnx.commit()

cols = list(frame.columns)
placeholders = ", ".join(["?"] * len(cols))
insert_stmt = f"INSERT INTO {target_tbl} ({', '.join(cols)}) VALUES ({placeholders})"

for idx, start in enumerate(range(0, len(frame), chunk_len)):
    block = frame.iloc[start:start + chunk_len]
    cur.fast_executemany = True

    # Problematic conversion: sometimes still leaves np.nan in the payload
    payload = block.map(lambda v: None if (pd.isna(v)) else v).values.tolist()

    try:
        cur.executemany(insert_stmt, payload)
        cnx.commit()
    except Exception as exc:
        pass

cur.close()
cnx.close()

When chunk_len is greater than one, some batches fail with errors like:

The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter N: The supplied value is not a valid instance of data type float.

Or:

Fractional truncation (0)

What’s actually wrong

Two facts drive the failure. First, pyodbc fast_executemany expects SQL NULL to be represented by Python’s None. It does not treat np.nan as NULL. Second, while pd.isna() correctly detects np.nan, the combination of DataFrame.map over mixed dtypes followed by .values.tolist() does not always yield None in place of every np.nan. The end result is that some batches still contain np.nan for float columns when they are sent to SQL Server, and the driver rejects them as “not a valid instance of data type float.”

There’s a related nuance with nullable integers: Pandas’ Int64 dtype uses pd.NA to represent missing values, not np.nan. So both placeholders for missingness can appear across your DataFrame simultaneously and need to be normalized before the batch is handed to pyodbc.

As one observer put it, this isn’t about a mystical sensitivity to batch size; it’s about the values inside those rows. A single-row insert might pass simply because that row doesn’t contain a problematic np.nan at the moment it’s evaluated, while a multi-row batch that includes np.nan will fail for the float parameters.

The fix: normalize all missing values to Python None

The simplest, robust approach is to replace missing markers directly on the batch DataFrame before converting to a list of lists. Explicitly map both np.nan and pd.NA to None so pyodbc can emit proper SQL NULL:

records = block.replace({np.nan: None, pd.NA: None}).values.tolist()

This targets float np.nan values (like those in entity_height and entity_weight) and pd.NA values used by nullable integer or object dtypes (like entity_age_code after switching to Int64). Doing the replacement at the DataFrame level avoids the inconsistent outcomes seen with map on heterogeneous columns.

Corrected batch insert loop

Below is the same loop with the single change applied to how rows are materialized:

cols = list(frame.columns)
placeholders = ", ".join(["?"] * len(cols))
insert_stmt = f"INSERT INTO {target_tbl} ({', '.join(cols)}) VALUES ({placeholders})"

for idx, start in enumerate(range(0, len(frame), chunk_len)):
    block = frame.iloc[start:start + chunk_len]
    cur.fast_executemany = True

    # Normalize all missing values to Python None before executemany
    records = block.replace({np.nan: None, pd.NA: None}).values.tolist()

    try:
        cur.executemany(insert_stmt, records)
        cnx.commit()
    except Exception as exc:
        pass

Why it’s worth knowing

When you push millions of rows, small inconsistencies in how missing data is represented can have outsized impact. With fast_executemany, the driver is strict about types and nulls; letting np.nan leak into float parameters leads to hard failures that look intermittent if you vary batch sizes. Normalizing to None at the DataFrame level removes that variability and makes inserts predictable and performant.

Wrap-up

If your batch inserts fail with messages about invalid float values or fractional truncation, and your data contains missing values, ensure every NaN and NA is converted to Python None before calling executemany. Pandas’ replace with {np.nan: None, pd.NA: None} applied on the batch is a concise way to achieve that. Keep nullable Int64 columns as Int64, but remember they use pd.NA under the hood, so include that mapping alongside np.nan. Once all nulls are normalized to None, fast, reliable bulk inserts follow.

The article is based on a question from StackOverflow by Abhishek Sourabh and an answer by jei.