2025, Sep 23 07:00
Build adjacency from a large pandas DataFrame without OOM: vectorized crosstab, PyTorch edge lists, sparse tensors
Avoid dense 132k x 132k matrices in pandas. Learn a memory-efficient workflow: vectorized crosstab, direct PyTorch edge list, and sparse tensor construction.
Building a dense adjacency matrix from a large pandas DataFrame can look straightforward until it explodes in memory. When your index spans range(132000), materializing a 132k × 132k square matrix becomes infeasible, especially if the end goal is an edge list for PyTorch. There is a cleaner, efficient path that avoids the dense intermediate entirely.
Reproducing the setup
Consider a DataFrame whose rows each contain a small set of integer values (with some NaNs). The aim is to mark, for each row, which indices have appeared as values in that row, ultimately ending up with adjacency data.
import pandas as pd
import numpy as np
from numpy.random import default_rng
# sample frame
df_demo = pd.DataFrame(index=[i for i in range(0, 10)], columns=list('abcd'))
for ridx in df_demo.index:
    df_demo.loc[ridx] = default_rng().choice(10, size=4, replace=False)
# inject NaNs
df_demo.loc[1, 'b'] = np.nan
df_demo.loc[3, 'd'] = np.nan
A naive approach is to loop through rows and populate a square 0/1 matrix whose rows and columns are the same index domain.
# naive square adjacency
adj_square = pd.DataFrame(index=df_demo.index, columns=df_demo.index)
for ridx in df_demo.index:
    adj_square.loc[ridx, df_demo.loc[ridx].dropna().values] = 1
adj_square = adj_square.replace(np.nan, 0)
Why this runs into trouble
This works for tiny data, but it does not scale. A 132k × 132k dense matrix is massive and will raise a MemoryError on typical hardware long before any useful work happens. The root cause isn’t the loop; it’s the attempt to realize a full square adjacency that most of the time remains very sparse.
Tell me - what’s 132_000**2?
If your final deliverable is an edge list tensor with shape (2, number_of_edges), filling a dense DataFrame is unnecessary and wasteful.
A vectorized pandas route (when you truly need a matrix)
If you do need a square adjacency representation in pandas, use a vectorized method that avoids Python loops. Flatten the frame with stack and let crosstab build the indicator matrix directly.
# vectorized adjacency via crosstab
series_flat = df_demo.stack()
adj_compact = (
    pd.crosstab(series_flat.index.get_level_values(0), series_flat.values)
      .rename_axis(index=None, columns=None)
)
This produces a 0/1 adjacency where rows are original indices and columns are the values observed in each row. It is compact and fast compared to manual loops. Still, if the index domain is 132k, a dense 132k × 132k object can exceed memory limits.
The efficient path: build the edge list directly for PyTorch
Skip the square matrix altogether and produce exactly what PyTorch expects. From the stacked Series you already have row indices and their corresponding values; those pairs are your edges.
import torch
row_ids = series_flat.index.get_level_values(0)
coords = torch.tensor([row_ids, series_flat.values], dtype=torch.int32)
The resulting coords is a tensor of shape (2, number_of_edges). If you later want a sparse square tensor, you can construct it directly without forming a dense intermediate.
sparse_mat = torch.sparse_coo_tensor(coords, torch.ones(len(series_flat)))
If there are duplicate coordinates in the input, you further need to coalesce.
Why this matters
Working with large-scale adjacency data requires acknowledging sparsity. Constructing a full dense matrix for 132k indices is not just slow; it is practically impossible in memory-constrained environments. A direct edge-list representation aligns with the intended downstream format and stays within feasible memory and compute budgets. Even if a square view is necessary, a sparse tensor is the right abstraction.
Takeaways
When the index space is large, prefer operating on stacked, long-form data and emit the minimal representation required by your target library. Use crosstab only if a matrix is absolutely needed, and prefer sparse tensors over dense DataFrames for adjacency at scale. This small change turns an out-of-memory failure into a compact, GPU-friendly pipeline.