2025, Nov 26 17:00

Stop Missing Nodes and Edges in graph-tool: Build a Proper COO Sparse Adjacency from pandas and SciPy

Learn why graph-tool drops vertices after pandas to SciPy conversion and fix it: build a COO sparse adjacency with set shape, symmetrize, deduplicate edges.

When you build an adjacency matrix from a pandas DataFrame and pass it to graph-tool, it’s easy to end up with a visualization that drops edges or vertices. A common symptom is a node that should have multiple connections appearing with only one, or a vertex index present in the data but missing in the rendered graph. The root cause is typically indexing and shape handling during the DataFrame → matrix → graph conversion.

Problem demonstration

The dataset is a two-column edge list. The adjacency matrix is made symmetric with pandas, then converted to a SciPy sparse matrix and drawn with graph-tool. The following reproduces the issue pattern:

import numpy as np
import scipy
import pandas as pd
from graph_tool.all import *

edges_df = pd.DataFrame({
    'p1': [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    'p2': [2, 4, 3, 4, 5, 14, 4, 5, 14, 17]
})

def make_square_adj(tbl, c_left, c_right):
    tbl_ct = pd.crosstab(tbl[c_left], tbl[c_right])
    unified = tbl_ct.columns.union(tbl_ct.index)
    out = tbl_ct.reindex(index=unified, columns=unified, fill_value=0)
    return out

m_ab = make_square_adj(edges_df, 'p1', 'p2')
m_ba = make_square_adj(edges_df, 'p2', 'p1')

sym_df = m_ab + m_ba

arr_dense = sym_df.to_numpy()
G_bad = Graph(scipy.sparse.lil_matrix(arr_dense), directed=False)
graph_draw(G_bad, vertex_text=G_bad.vertex_index)

What’s actually going wrong

There are two subtle but critical details. First, graph-tool vertices are 0-indexed when a SciPy matrix is imported. Second, when you build a sparse matrix without explicitly defining its shape using the maximum vertex id, the internal indexing will only cover the coordinates present or the dimensions of the provided dense array—neither of which guarantees alignment with your original labels if they are 1-indexed or non-contiguous. The result is a mismatch between your semantic node ids and the matrix row/column positions, which then looks like missing nodes or edges in the layout.

If your edge list contains labels like 1, 2, 3, 5, 14, 17, the crosstab-based matrix is assembled over the set of unique labels, then converted to a dense array whose indices run from 0 to len(unique)-1. Feeding that into graph-tool yields vertices 0..N-1, not your original labels. Gaps in the labels exacerbate the confusion. The fix is to construct the sparse adjacency directly in a coordinate (COO) format with an explicit shape that spans up to the largest id, symmetrize it for an undirected graph, and, if needed, deduplicate multi-edges.

Solution

Build the adjacency in COO form from the edge list, enforce integer dtypes, define the shape as max label + 1, symmetrize by adding the transpose, and collapse duplicates by thresholding.

import numpy as np
import scipy
import pandas as pd
from graph_tool.all import *

edge_pairs = pd.DataFrame({
    'p1': [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    'p2': [2, 4, 3, 4, 5, 14, 4, 5, 14, 17]
})

def build_coo_undirected(edge_tab, src_col, dst_col):
    src_vals, dst_vals = edge_tab[src_col].values, edge_tab[dst_col].values
    assert pd.api.types.is_integer_dtype(src_vals), "src_col must have integer type"
    assert pd.api.types.is_integer_dtype(dst_vals), "dst_col must have integer type"
    n_vertices = max(np.max(src_vals), np.max(dst_vals)) + 1
    unit_weights = np.ones_like(src_vals)
    coo_mat = scipy.sparse.coo_matrix(
        (unit_weights, (src_vals, dst_vals)), shape=(n_vertices, n_vertices)
    )
    coo_mat = coo_mat + coo_mat.T
    coo_mat = (coo_mat >= 1).astype('int8')
    return coo_mat

adj_coo = build_coo_undirected(edge_pairs, 'p1', 'p2')
G_ok = Graph(adj_coo, directed=False)
print(adj_coo.toarray())
graph_draw(G_ok, vertex_text=G_ok.vertex_index)

Why this works

The COO representation mirrors the edge list structure: coordinates and values. Creating it directly from the two columns avoids any reindexing side effects from dense intermediate structures. Providing the matrix shape forces inclusion of empty rows/columns for any missing labels between 0 and the maximum id, which stabilizes vertex indexing when the graph is imported. Adding the transpose guarantees symmetry for an undirected graph. Thresholding to values greater or equal to one collapses duplicates from both directions into single edges.

Why you want this on your radar

Graph pipelines that hop between pandas, NumPy, SciPy, and graph-tool are sensitive to indexing conventions and shape inference. A visually “wrong” graph often stems from a harmless-looking conversion that silently renumbers nodes. Getting the sparse shape and indexing right at the start saves hours of debugging layout artifacts that are in fact data alignment issues.

Takeaways

Feed graph-tool with a sparse adjacency that respects your original integer labels and explicitly specifies the matrix shape. Keep node ids integral, build in COO format from the edge list, symmetrize for undirected graphs, and collapse duplicates if necessary. With that in place, the visualization will reflect the true connectivity of your data rather than the side effects of intermediate index remapping.

adjacency-matrix graph-tool pandas python