https://pytroubles.com/en/posts/id2367-how-to-save-a-pandas-dataframe-with-ragged-arrays-to-hdf5-using-h5py-variable-length-dtype-vlen

How to Save a pandas DataFrame with ragged arrays to HDF5 using h5py variable-length dtype (vlen)

Store a pandas DataFrame with variable-length (ragged) arrays in HDF5 using the h5py vlen dtype

How to Save a pandas DataFrame with ragged arrays to HDF5 using h5py variable-length dtype (vlen)

Persist a pandas DataFrame with ragged arrays to HDF5 using h5py variable-length dtype; interoperable layout, column-wise storage, no pandas table format errors.

2025-12-09T15:00:12+03:00

Persisting a pandas DataFrame where one column holds arrays of different lengths sounds simple until you hit HDF5. The common routes either lock you into a pandas-only layout or crash on ragged shapes. Below is a concise guide to store such data in a reusable HDF5 file without giving up the variable-length nature of your arrays.Problem setupYou have a DataFrame with one column containing variable-length NumPy arrays and another column of regular numeric values. A direct attempt to save this in an interoperable way via pandas or plain h5py fails.payload = { 'alpha': [np.array([1., 2.]), np.array([6., 7., .6]), np.array([np.nan])], 'beta': np.array([99., 66., 88.]) } frame = pd.DataFrame(payload) # Pandas HDFStore, table format frame.to_hdf('store.h5', mode='w', key='tbl', format='table') This raises an error because the column with arrays is an object-dtype with mixed content.TypeError: Cannot serialize the column [alpha] because its data contents are not [string] but [mixed] object dtypeSwitching to a naive h5py write with a list of rows also fails, because HDF5 datasets expect rectangular shapes by default.with h5py.File('store.h5', 'w') as f: f.create_dataset('tbl', data=frame.to_numpy().tolist()) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.What’s actually going onHDF5 datasets are not ragged by default. Pandas “table” format wants homogeneously typed, columnar data, and an object-dtype column that holds NumPy arrays of varying length does not satisfy that requirement. Meanwhile, writing the entire 2D representation with h5py forces HDF5 to infer a single rectangular shape, which breaks because one column contains sequences of different sizes. The “fixed” pandas format does write, but the result is tailored to pandas internals and is not intended as a generic layout for reuse elsewhere.The solution with variable-length dtypeHDF5 supports variable-length elements via a dedicated dtype. Using h5py, you can declare a variable-length float dtype and store the column with arrays as a dataset where each element is its own 1D float array. The other column remains a regular numeric dataset.# build the same data payload = { 'alpha': [np.array([1., 2.]), np.array([6., 7., .6]), np.array([np.nan])], 'beta': np.array([99., 66., 88.]) } frame = pd.DataFrame(payload) # write to HDF5 with variable-length dtype for the ragged column store = h5py.File('archive.h5', mode='w') varlen_f64 = h5py.vlen_dtype(np.dtype('float64')) store.create_dataset('alpha', data=frame['alpha'], dtype=varlen_f64) store['beta'] = frame['beta'] This keeps the arrays in the first column as variable-length float vectors and writes the second column as a standard numeric dataset.Why this mattersChoosing the right HDF5 representation is crucial when your DataFrame mixes regular columns with ragged arrays. The pandas “table” format is oriented toward uniform, serializable columns and will reject mixed object-dtype data. Dumping the entire 2D structure in one shot also fails because HDF5 attempts to make it rectangular. Declaring a variable-length dtype aligns the in-memory structure with an HDF5-native construct, preserving the shape differences row by row.Practical takeawaysIf you need to archive pandas data that contains variable-length arrays per row, avoid relying on a pandas-specific layout when interoperability is important. Store each column explicitly and use a variable-length dtype for the ragged one. This pattern preserves the data’s structure while staying within HDF5’s data model.In short, keep the DataFrame, but separate the concerns during serialization: regular numeric columns become standard datasets, and the ragged column becomes a variable-length dataset. That’s all you need to get a robust and reusable HDF5 file.

pandas DataFrame, HDF5, h5py, variable-length dtype, vlen dtype, ragged arrays, variable-length arrays, serialize DataFrame, save DataFrame to HDF5, NumPy, pandas HDFStore, Python

2025

2025, Dec 09 15:00

Store a pandas DataFrame with variable-length (ragged) arrays in HDF5 using the h5py vlen dtype

Persist a pandas DataFrame with ragged arrays to HDF5 using h5py variable-length dtype; interoperable layout, column-wise storage, no pandas table format errors.

Problem setup

You have a DataFrame with one column containing variable-length NumPy arrays and another column of regular numeric values. A direct attempt to save this in an interoperable way via pandas or plain h5py fails.

payload = {
    'alpha': [np.array([1., 2.]), np.array([6., 7., .6]), np.array([np.nan])],
    'beta': np.array([99., 66., 88.])
}
frame = pd.DataFrame(payload)
# Pandas HDFStore, table format
frame.to_hdf('store.h5', mode='w', key='tbl', format='table')

This raises an error because the column with arrays is an object-dtype with mixed content.

TypeError: Cannot serialize the column [alpha] because its data contents are not [string] but [mixed] object dtype

Switching to a naive h5py write with a list of rows also fails, because HDF5 datasets expect rectangular shapes by default.

with h5py.File('store.h5', 'w') as f:
    f.create_dataset('tbl', data=frame.to_numpy().tolist())

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.

What’s actually going on

HDF5 datasets are not ragged by default. Pandas “table” format wants homogeneously typed, columnar data, and an object-dtype column that holds NumPy arrays of varying length does not satisfy that requirement. Meanwhile, writing the entire 2D representation with h5py forces HDF5 to infer a single rectangular shape, which breaks because one column contains sequences of different sizes. The “fixed” pandas format does write, but the result is tailored to pandas internals and is not intended as a generic layout for reuse elsewhere.

The solution with variable-length dtype

HDF5 supports variable-length elements via a dedicated dtype. Using h5py, you can declare a variable-length float dtype and store the column with arrays as a dataset where each element is its own 1D float array. The other column remains a regular numeric dataset.

# build the same data
payload = {
    'alpha': [np.array([1., 2.]), np.array([6., 7., .6]), np.array([np.nan])],
    'beta': np.array([99., 66., 88.])
}
frame = pd.DataFrame(payload)
# write to HDF5 with variable-length dtype for the ragged column
store = h5py.File('archive.h5', mode='w')
varlen_f64 = h5py.vlen_dtype(np.dtype('float64'))
store.create_dataset('alpha', data=frame['alpha'], dtype=varlen_f64)
store['beta'] = frame['beta']

This keeps the arrays in the first column as variable-length float vectors and writes the second column as a standard numeric dataset.

Why this matters

Choosing the right HDF5 representation is crucial when your DataFrame mixes regular columns with ragged arrays. The pandas “table” format is oriented toward uniform, serializable columns and will reject mixed object-dtype data. Dumping the entire 2D structure in one shot also fails because HDF5 attempts to make it rectangular. Declaring a variable-length dtype aligns the in-memory structure with an HDF5-native construct, preserving the shape differences row by row.

Practical takeaways

If you need to archive pandas data that contains variable-length arrays per row, avoid relying on a pandas-specific layout when interoperability is important. Store each column explicitly and use a variable-length dtype for the ragged one. This pattern preserves the data’s structure while staying within HDF5’s data model.

In short, keep the DataFrame, but separate the concerns during serialization: regular numeric columns become standard datasets, and the ragged column becomes a variable-length dataset. That’s all you need to get a robust and reusable HDF5 file.

hdf5 pandas python