2025, Dec 06 01:00

Faster PostgreSQL ingestion from pandas DataFrames: use COPY, staging tables, and ON CONFLICT upserts instead of df.to_sql chunking

Learn how to load pandas DataFrames into PostgreSQL the right way: COPY for bulk inserts, staging tables, and ON CONFLICT upserts. Skip df.to_sql chunking.

If you routinely push a pandas DataFrame into PostgreSQL and start splitting operations to keep an aging server alive, you are not alone. A common pattern is to chunk df.to_sql() uploads and to probe for existing rows using a massive WHERE chain on a primary key column. That feels safe, but it burns CPU cycles you don’t need to spend. PostgreSQL can handle bulk ingestion and conflict resolution much more efficiently if you lean on the right primitives.

What the problematic pattern looks like

One frequent bottleneck is a pre-check for existing rows using an ever-growing disjunction on a primary key of type timestamp. It resembles this:

SELECT ts_col FROM audit_log
WHERE ts_col = ts_row1_df
   OR ts_col = ts_row2_df
   ...
   OR ts_col = ts_row3500_df

You might also try to lower the CPU impact of the write path by chunking df.to_sql() uploads, for example with chunks of 150 rows. Chunking feels intuitive, but it’s not the most effective lever for PostgreSQL ingestion.

Why this hurts

Breaking existence checks into many small queries forces the database to repeat work. Each subset triggers another scan and evaluation of a long OR chain, which adds overhead without improving the actual logic. The database is designed to resolve “already present vs new” more directly, provided you declare the correct key and let the engine enforce it. For modest volumes like a few thousand rows, PostgreSQL can resolve conflicts very quickly if you hand it the data in a set-based way rather than row-by-row checks from Python.

On the write side, chunking df.to_sql() is convenient but not the optimal way to reduce CPU on the server. PostgreSQL has a native bulk-loading mechanism that is both faster and lighter on resources for large inserts.

The better approach: COPY and let the database handle conflicts

The recommended ingestion path is COPY. It moves data into PostgreSQL with minimal overhead, provided your DataFrame is sanitized and its column order and names match the target table.

COPY core_events FROM '/app/batch_records.csv' csv header;

Prepare your DataFrame in Python, align the schema, and write out a CSV that COPY can consume. Do the data validation and sanitization on the DataFrame before ingestion.

Streaming rows with psycopg

You can also feed rows programmatically using psycopg’s copy interface. As documented:

Using a copy operation you can load data into the database from any Python iterable (a list of tuples, or any iterable of sequences): the Python values are adapted as they would be in normal querying. To perform such operation use a COPY ... FROM STDIN with Cursor.copy() and use write_row() on the resulting object in a with block. On exiting the block the operation will be concluded. If an exception is raised inside the block, the operation is interrupted and the records inserted so far are discarded.

payloads = [(10, 20, "hello"), (40, None, "world")]
with db_cur.copy("COPY demo_sink (c1, c2, c3) FROM STDIN") as cp:
    for item in payloads:
        cp.write_row(item)

If you work with psycopg2, use copy_expert for COPY. One practical way to stream a DataFrame is to serialize it in-memory to CSV and pass the buffer to COPY:

from io import StringIO
import csv
csv_buf = StringIO()
writer = csv.writer(csv_buf)
writer.writerows(df_values)
csv_buf.seek(0)
cur.copy_expert(sql=copy_sql, file=csv_buf)

Looping Python-side over a large DataFrame to insert row by row is not recommended; that will also take a lot of processing time.

Handling duplicates: staging table and ON CONFLICT

If your incoming batch may collide with existing rows, load into a temporary staging table first, then perform a single set-based insert with ON CONFLICT to upsert only the affected rows. This avoids the expensive “does it exist?” probing from Python.

sql_block = """
CREATE TEMPORARY TABLE stage_load (
    ts_key ,
    full_name,
    details
);
COPY stage_load FROM STDIN With CSV;
INSERT INTO prod_main(ts_key, full_name, details)
SELECT *
FROM stage_load ON conflict (ts_key) 
DO update set full_name=EXCLUDED.full_name, details=EXCLUDED.details;
DROP TABLE stage_load;
"""

Remember that the ON CONFLICT clause relies on indexing. If you need to detect conflicts, the relevant index must be enabled for that insert.

When indexes slow you down, and when you can disable them

Indexes can dominate cost during massive inserts because the database updates each index entry for every row. In cases where the staging data is guaranteed not to conflict with the destination, it can be faster to temporarily disable indexes during the insert and rebuild them afterward. As a practitioner summarized:

When you run a large query (insert/update) on a huge table with several indexes, these indexes can seriously slow the query execution. With Postgresql it can be very faster to disable the indexes before runing the query and reindex all the table afterwards.

And from an experience report:

The problem was with indexes. The history table had 160M indexed rows. By running either COPY FROM or INSERT INTO .. SELECT it was taking a lot of time not to insert rows, but to update indexes. When i disabled indexes, it imported 3M rows in 10 seconds. Now i need to find faster way of reindexing the big table.

Note the trade-off. You cannot use INSERT ... ON CONFLICT while indexes are disabled, because conflict detection requires those indexes. Enable indexing when you perform the upsert from the staging table to the main table.

Answering the two core questions

First, using df.to_sql() with chunksize is not the best method to reduce CPU load on the PostgreSQL server. Prefer COPY FROM for ingestion; it is the recommended path for loading large datasets quickly, provided you sanitize your DataFrame and align column names and order with the table.

Second, splitting a massive WHERE clause into multiple smaller queries generally increases work. Instead of probing for existence from Python, rely on the database to handle conflicts. Define the appropriate primary/alternate keys and let PostgreSQL enforce them with ON CONFLICT. If you have conflicts, use a staging table plus INSERT ... ON CONFLICT DO UPDATE. If you do not have conflicts, consider disabling indexes during the bulk insert and reindex afterward to minimize overhead.

Why this matters

Shifting from Python-driven per-row logic and many small queries to set-based operations keeps CPU usage in check and reduces the risk of lockups on modest hardware. COPY-based ingestion is designed for throughput. Conflict handling with ON CONFLICT is what PostgreSQL does well when keys are defined properly. These patterns are more resilient, simpler to reason about, and easier to operate.

Practical wrap-up

Prepare and clean your DataFrame first, matching the PostgreSQL schema exactly. Load with COPY FROM to get near-instant ingestion. If you need upserts, copy into a temporary table and perform a single INSERT ... ON CONFLICT DO UPDATE into the main table with the required index enabled. If the batch is guaranteed to be conflict-free, you can improve insert speed by temporarily disabling indexes and reindexing after the load. Avoid Python loops over large DataFrames and skip ad hoc existence checks from the client; let PostgreSQL handle that logic in one pass.