2025, Nov 03 09:00
Assigning Subgraph IDs Without Hub Collapse: traverse related_user_id -> actor_user_id to group edges
Prevent over-grouping in user-to-user interactions: assign subgraph IDs by traversing related_user_id -> actor_user_id, not weakly connected components.
When you label user-to-user interactions with a subgraph ID, the intuitive approach is to compute connected components over a graph of actor_user_id → related_user_id. The catch is that shared actors often act like hubs, collapsing many unrelated edges into one giant component. If the goal is to join only those rows that truly share connection paths on the related_user_id side, the standard weakly connected components over the actor → related topology will over-group your data.
Reproducing the pitfall
The following minimal example mirrors the setup where a recurring actor creates a single, overly large component, even though the related_user_id branches do not overlap:
import pandas as pd
import networkx as nx
records = pd.DataFrame({
'properties_id': ['A', 'A', 'A', 'A'],
'global_journey_id': ['A1', 'A1', 'A1', 'A1'],
'actor_user_id': ['abc', 'abc', 'pat', 'abc'],
'related_user_id': ['def', 'efg', 'def', 'lal'],
})
net = nx.DiGraph()
net.add_edges_from(zip(records['actor_user_id'], records['related_user_id']))
parts = list(nx.weakly_connected_components(net))
Because the node abc appears in several edges, weak connectivity ties everything together. That is the opposite of the intent, where you only want to group interactions if their related_user_id nodes are connected, directly or through co-occurrence paths, and not merely because they share the same actor_user_id.
What counts as a shared connection here
The grouping rule is stricter than generic connectivity. Two rows belong to the same subgraph if their related_user_id values are linked through an actual chain of shared occurrences under actors. If two edges share only the same actor_user_id but their related_user_id branches never touch, they must remain in different subgraphs. In other words, actor nodes should not behave like highways that merge unrelated branches; the related side must overlap for grouping to happen.
A practical way to enforce the rule
The robust way is to flip the perspective. Instead of wiring actor_user_id → related_user_id and taking components, look from the related side toward actors and connect only where explicit paths exist. This can be achieved by traversing edges in the direction related_user_id → actor_user_id and grouping only those edge pairs that are actually reachable through that direction. It prevents a shared actor from merging branches unless the related_user_id side also connects.
Traverse from related_user_id → actor_user_id and assign subgraphs only where explicit connections exist; do not group by shared actors alone.
Implementation
The snippet below groups within properties_id and global_journey_id, builds a directional hop map from related_user_id to actor_user_id, explores reachable edge chains, and maps a sub_graph label back to the original rows. The program logic preserves the intended behavior without collapsing unrelated related_user_id branches under a common actor.
import pandas as pd
from collections import defaultdict, deque
sample = {
'properties_id': ['A', 'A', 'A', 'A'],
'global_journey_id': ['A1', 'A1', 'A1', 'A1'],
'actor_user_id': ['abc', 'abc', 'pat', 'abc'],
'related_user_id': ['def', 'efg', 'def', 'lal'],
}
frame = pd.DataFrame(sample)
out_rows = []
batches = frame.groupby(['properties_id', 'global_journey_id'])
for (pid, jid), chunk in batches:
hop_map = defaultdict(list)
for _, rec in chunk.iterrows():
hop_map[rec['related_user_id']].append(rec['actor_user_id'])
pair_keys = [frozenset([rec['actor_user_id'], rec['related_user_id']]) for _, rec in chunk.iterrows()]
pair_set = set(pair_keys)
seen_pairs = set()
pair2sg = {}
sg_id = 1
for pair in pair_set:
if pair in seen_pairs:
continue
a, b = list(pair)
seed = b if b in hop_map else a
if seed not in hop_map:
continue
q = deque([seed])
cluster_pairs = set()
while q:
cur = q.popleft()
for nxt in hop_map.get(cur, []):
candidate = frozenset([cur, nxt])
if candidate in pair_set and candidate not in seen_pairs:
seen_pairs.add(candidate)
cluster_pairs.add(candidate)
if nxt in hop_map:
q.append(nxt)
for pr in cluster_pairs:
pair2sg[pr] = sg_id
sg_id += 1
for _, rec in chunk.iterrows():
k = frozenset([rec['actor_user_id'], rec['related_user_id']])
label = pair2sg.get(k, None)
out_rows.append({
'properties_id': rec['properties_id'],
'global_journey_id': rec['global_journey_id'],
'actor_user_id': rec['actor_user_id'],
'related_user_id': rec['related_user_id'],
'sub_graph': label
})
result_df = pd.DataFrame(out_rows)
result_df = result_df.sort_values(by=['properties_id', 'global_journey_id', 'sub_graph'])
print(result_df)
This keeps the grouping faithful to the requirement: two interactions end up in the same subgraph only when their related_user_id nodes are connected through explicit paths of co-occurrence under actors. A shared actor alone does not trigger a merge.
Why it is worth the extra care
For graph-based analysis of user journeys or interaction logs, component assignments directly affect downstream aggregation, alerting, and reporting. If hubs like actor_user_id collapse unrelated branches, metrics blur across independent interactions and incident triage becomes noisy. By reorienting traversal toward related_user_id and restricting grouping to explicit connectivity, the resulting subgraphs align with the semantics of shared context rather than with arbitrary topology artifacts.
Takeaways
Choose the graph topology that expresses your grouping criterion. When the rule is about overlap on related_user_id rather than shared actors, do not compute components over the naive actor → related graph. Instead, traverse related_user_id → actor_user_id, derive connected edge sets under that direction, and map sub_graph labels back to the original rows. This prevents over-grouping, keeps branches isolated when they should be, and yields subgraphs that match the intended relationships.
The article is based on a question from StackOverflow by patrick and an answer by Daniel Raphael.