2025, Oct 19 13:00

How to Store NumPy Arrays in PostgreSQL: choose tobytes + shape/dtype or msgpack-numpy

Learn practical ways to persist NumPy arrays in PostgreSQL: tobytes with shape/dtype or msgpack-numpy blobs. Compare speed, size, and simplicity. See trade-offs.

Persisting NumPy arrays in PostgreSQL sounds straightforward until you hit the wall of metadata: raw bytes don’t carry shape or dtype. A common workaround is to serialize with NumPy’s tobytes and stash shape and dtype in separate columns. It works, but it’s easy to feel the friction when you repeat it across multiple tables and call sites.

Baseline: storing arrays as raw bytes

The approach below uses psycopg, numpy.tobytes, and a whitelist of target tables. It inserts or updates, and reconstructs arrays on read by combining the stored bytes with saved shape and dtype.

import psycopg
from psycopg import sql
import os
import orjson
import numpy as np
pg_conn_opts = {
    'dbname': os.getenv("PG_DATABASE"),
    'user': os.getenv("PG_USERNAME"),
    'password': os.getenv("PG_PASSWORD"),
    'host': os.getenv("PG_HOST"),
    'connect_timeout': os.getenv("PG_CONNECT_TIMEOUT")
}
def persist_ndarray(arr, arr_key, target_table):
    if target_table in ['table_a', 'table_b', 'table_c']:
        with psycopg.connect(**pg_conn_opts) as cxn:
            with cxn.cursor() as cur:
                try:
                    cur.execute(sql.SQL(
                        """
                        INSERT INTO {tbl}(array_id, np_array_bytes, np_array_shape, np_array_dtype)
                        VALUES (%s, %s, %s, %s)
                        ON CONFLICT (array_id) DO UPDATE SET
                        np_array_bytes = EXCLUDED.np_array_bytes,
                        np_array_shape = EXCLUDED.np_array_shape,
                        np_array_dtype = EXCLUDED.np_array_dtype
                        """
                    ).format(tbl=sql.Identifier(target_table)),
                    [arr_key, arr.tobytes(), orjson.dumps(arr.shape), str(arr.dtype)])
                    cxn.commit()
                except:
                    print("Error while saving NumPy array to database.")
                finally:
                    cxn.close()
    else:
        print("You're attempting to write arrays to a non-approved table.")
def fetch_ndarray(arr_key, target_table):
    if target_table in ['table_a', 'table_b', 'table_c']:
        with psycopg.connect(**pg_conn_opts) as cxn:
            with cxn.cursor() as cur:
                cur.execute(sql.SQL(
                    """
                    SELECT np_array_bytes, np_array_shape, np_array_dtype
                    FROM {tbl}
                    WHERE array_id = %s
                    """
                ).format(tbl=sql.Identifier(target_table)), [arr_key])
                row = cur.fetchone()
                if row:
                    blob, shp, dt = row
                    return np.frombuffer(blob, dtype=dt).reshape(orjson.loads(shp))

What makes this feel heavier than it should

tobytes is fast, but raw bytes know nothing about the array they came from. That forces you to track shape and dtype separately, wire them through SQL, and reconstruct manually on read. It’s reliable but verbose, and every call site repeats the same choreography. If you need to persist arrays in several tables, the overhead grows with each insert and select.

A simpler alternative: msgpack-numpy

msgpack-numpy can serialize a NumPy array as a compact blob that already includes shape and dtype. That means you can persist a single binary value and decode it back into the exact array without juggling extra columns. As a data point:

It is slightly slower than A.tobytes(), but makes up for it in convenience by storing shape and dtype. It is also smaller and faster than NumPy's native np.save() format.

There is also a practical observation from testing with large arrays:

tobytes is about 2x faster for arrays with a million numbers when packing, and way faster (1000–2000x) when unpacking. Both are fast though so it doesn’t really matter unless the arrays are huge.

In other words, if you want convenience with minimal overhead, msgpack-numpy is a strong fit. If you’re chasing absolute speed on very large arrays, the raw tobytes route can still be compelling.

Revised code: packing and unpacking with msgpack-numpy

The snippet below shows how to store and load using msgpack-numpy while keeping the database interaction model the same. The bytes go into np_array_bytes; shape and dtype are still written as before, though the packed payload already contains them.

import psycopg
from psycopg import sql
import os
import orjson
import numpy as np
import msgpack
import msgpack_numpy as mpn
pg_conn_opts = {
    'dbname': os.getenv("PG_DATABASE"),
    'user': os.getenv("PG_USERNAME"),
    'password': os.getenv("PG_PASSWORD"),
    'host': os.getenv("PG_HOST"),
    'connect_timeout': os.getenv("PG_CONNECT_TIMEOUT")
}
def persist_ndarray_msgpack(arr, arr_key, target_table):
    if target_table in ['table_a', 'table_b', 'table_c']:
        with psycopg.connect(**pg_conn_opts) as cxn:
            with cxn.cursor() as cur:
                try:
                    packed = msgpack.packb(arr, default=mpn.encode)
                    cur.execute(sql.SQL(
                        """
                        INSERT INTO {tbl}(array_id, np_array_bytes, np_array_shape, np_array_dtype)
                        VALUES (%s, %s, %s, %s)
                        ON CONFLICT (array_id) DO UPDATE SET
                        np_array_bytes = EXCLUDED.np_array_bytes,
                        np_array_shape = EXCLUDED.np_array_shape,
                        np_array_dtype = EXCLUDED.np_array_dtype
                        """
                    ).format(tbl=sql.Identifier(target_table)),
                    [arr_key, packed, orjson.dumps(arr.shape), str(arr.dtype)])
                    cxn.commit()
                except:
                    print("Failed to persist NumPy array via msgpack.")
                finally:
                    cxn.close()
    else:
        print("Refusing to write to unexpected table.")
def fetch_ndarray_msgpack(arr_key, target_table):
    if target_table in ['table_a', 'table_b', 'table_c']:
        with psycopg.connect(**pg_conn_opts) as cxn:
            with cxn.cursor() as cur:
                cur.execute(sql.SQL(
                    """
                    SELECT np_array_bytes, np_array_shape, np_array_dtype
                    FROM {tbl}
                    WHERE array_id = %s
                    """
                ).format(tbl=sql.Identifier(target_table)), [arr_key])
                row = cur.fetchone()
                if row:
                    packed, shp, dt = row
                    return msgpack.unpackb(packed, object_hook=mpn.decode)

This keeps your data model untouched and removes the need to coordinate shape and dtype on the read path, since they come back embedded in the payload. If you control the schema, you can simplify further by storing only a single binary column.

Why this matters

Choosing the right serialization path determines how much glue code you will maintain. Raw bytes push that complexity into your application with extra metadata handling. msgpack-numpy centralizes it in the serialization step. The trade-off is performance nuance: tobytes can be faster, especially when decoding very large arrays, while msgpack-numpy optimizes for ergonomics and is smaller and faster than np.save.

Takeaways

If your priority is a clean and maintainable pipeline across multiple PostgreSQL tables, msgpack-numpy is a practical option that carries shape and dtype for you. If you are bandwidth-constrained and need the absolute fastest encode/decode on very large arrays, continuing with tobytes plus explicit shape and dtype is a valid choice. Either way, keep table access constrained to known destinations and be consistent about dtype handling so that what you store is exactly what you read.

The article is based on a question from StackOverflow by hmmmmmmasdmakjd and an answer by Nick ODell.