2025, Dec 14 15:00

Stop Writing DDL: Load Feather Files to PostgreSQL with pandas to_sql and SQLAlchemy

Learn how to skip manual DDL: use pandas DataFrame.to_sql and SQLAlchemy to create PostgreSQL tables and bulk load data from Feather files quickly and reliably.

When you have a directory full of Feather files with dozens of columns and hundreds of thousands of rows, handcrafting CREATE TABLE statements for each file is both tedious and error-prone. The goal is obvious: generate tables and load the data without manually defining every column and its type over and over.

What the initial approach looks like

Below is a compact example that scans a folder, reads each .feather file into a DataFrame, infers PostgreSQL types from pandas dtypes, and assembles a CREATE TABLE statement. It mirrors the common impulse to build DDL programmatically and then figure out how to bulk insert the data.

import os
import glob
import pandas as pd
base_dir = "E:\\file_path"
feather_items = glob.glob(os.path.join(base_dir, '*.feather'))
for asset_path in feather_items:
    print(f"File: {asset_path}")
    frame = pd.read_feather(asset_path)
    rel_name = {asset_path}
    type_map = frame.dtypes.to_dict()
    sql_parts = []
    def pd_kind_to_pg(kind):
        if pd.api.types.is_integer_dtype(kind):
            return "INTEGER"
        elif pd.api.types.is_float_dtype(kind):
            return "FLOAT"
        elif pd.api.types.is_bool_dtype(kind):
            return "BOOLEAN"
        elif pd.api.types.is_datetime64_any_dtype(kind):
            return "TIMESTAMP"
        else:
            return "TEXT"
    for col_name, kind in type_map.items():
        pg_kind = pd_kind_to_pg(kind)
        sql_parts.append(f'"{col_name}" {pg_kind}')
    ddl_stmt = f"CREATE TABLE {rel_name} (  {' '.join(sql_parts)} );"

Why this becomes a roadblock

Even if you successfully generate DDL, you still need a reliable way to push actual rows into the database with the same schema. Doing it column by column quickly turns into an unwieldy insertion routine. That’s exactly where many pipelines stall: the table definition is there, but the bulk load is missing.

The straightforward path: let pandas write the table and the data

pandas provides a built-in method to store a DataFrame in a SQL database. In practice, this lets you avoid hand-rolled DDL and separate insert logic for each file. The DataFrame drives the schema and the data load in one call.

from sqlalchemy import create_engine
import pandas as pd
import os
import glob
root_dir = "E:\\file_path"
feathers = glob.glob(os.path.join(root_dir, '*.feather'))
for asset in feathers:
    print(f"File: {asset}")
    frame = pd.read_feather(asset)
    dest_name = asset
    engine_obj = create_engine("postgresql+psycopg2://user:pass@localhost/database")
    frame.to_sql(name=dest_name, con=engine_obj)

This uses the database connection created by SQLAlchemy and writes each DataFrame directly to a table. The table name is derived from the file path, matching the provided approach.

Why it matters

Replacing manual DDL generation and ad-hoc insert loops with a single call simplifies maintenance and reduces the chance of schema mismatches. When you are iterating through many files with large column sets, cutting down on bespoke SQL is a practical win that keeps the workflow consistent from file discovery to data load.

Takeaways

If you are already using pandas to read Feather files, lean on DataFrame.to_sql to handle both schema materialization and data insertion. It saves time compared to writing CREATE TABLE statements and separate insert logic for each file and keeps your pipeline focused on data rather than boilerplate SQL.