https://pytroubles.com/en/posts/id996-vectorized-binary-indicator-matrix-in-pandas-dataframes-reshape-one-hot-encode-groupby-transpose

Vectorized Binary Indicator Matrix in Pandas DataFrames: Reshape, One-Hot Encode, Groupby, Transpose

Build a Binary Indicator Matrix from DataFrame Rows in Pandas with stack, get_dummies, and groupby

Vectorized Binary Indicator Matrix in Pandas DataFrames: Reshape, One-Hot Encode, Groupby, Transpose

Learn to convert rows of integers into a binary indicator matrix in Pandas using stack, get_dummies, groupby and transpose for a fast, loop-free solution.

2025-10-18T12:00:06+03:00

Turning rows of integers into a binary indicator matrix looks like one-hot encoding at first glance, but the goal here is different: for each original row, we want a single binary column that marks positions (index values) present in that row and zeros everywhere else. The challenge is doing this for many rows at once without loops.Example and a naive attemptConsider a simple Pandas DataFrame where each row contains integer positions we want to flag:import pandas as pd data = pd.DataFrame({ "A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9], }, index=[0, 1, 2]) A straightforward try for a single row is to create a range of all possible positions and check membership against that row. It does work for one row, but it doesn’t scale across the entire frame:src_df = data out_grid = pd.DataFrame(range(src_df.max(axis=None) + 1)).isin(list(src_df.iloc[0])) This approach only handles the first row and doesn’t generalize to all rows at once.What’s actually happeningThe structure we want emerges if we switch perspective. If we pivot the wide table to a long series, we can one-hot encode the integer values, then aggregate by the original rows. That preserves the positions and makes it trivial to build a binary indicator matrix for every row in one pass.Vectorized solution in PandasHere is a compact, fully vectorized pipeline that performs the transformation end-to-end:result = pd.get_dummies(data.stack()).groupby(level=0).sum().T It proceeds in four steps. First, stack reshapes the data to a long Series with a MultiIndex consisting of the original row index and the former column labels. Next, get_dummies converts the integer values into one-hot columns. Then groupby(level=0).sum() aggregates these indicators by the original rows (the first level of the MultiIndex). Finally, .T transposes so that integer positions become the row index and the original rows become columns.The resulting output matches the intended binary indicator layout: 0 1 2 1 1 0 0 2 0 1 0 3 0 0 1 4 1 0 0 5 0 1 0 6 0 0 1 7 1 0 0 8 0 1 0 9 0 0 1 Peeking under the hoodIf you want to see the intermediate objects for clarity, break it into named steps. This helps verify how the data flows through the pipeline without writing any loops:s_long = data.stack() ohe = pd.get_dummies(s_long) agg = ohe.groupby(level=0).sum() final = agg.T The long series produced by stack pairs the original row index with the original column name. get_dummies converts the integer values into indicator columns. Grouping by the first index level sums indicators within each original row. Transposing rotates the table into the desired orientation with positions as the index.Why this mattersReshaping then one-hot encoding avoids custom loops and per-row logic while keeping everything in idiomatic Pandas. The transformation is easy to read, debuggable, and applies to any number of rows. If you were thinking about a for-loop or .apply() over rows, this pattern eliminates that need. Also, sticking with numeric indexing keeps the operations predictable and clear.ConclusionWhen you need a binary indicator matrix that marks integer positions from each row, reshape to long with stack, one-hot encode with get_dummies, aggregate by the original rows, and transpose. The single expression pd.get_dummies(data.stack()).groupby(level=0).sum().T solves the problem in a vectorized way and scales naturally across the entire DataFrame.

pandas, python, binary indicator matrix, one-hot encoding, get_dummies, stack, groupby, transpose, DataFrame, vectorized solution, no loops, reshape, wide to long, one-hot encode

2025

2025, Oct 18 12:00

Build a Binary Indicator Matrix from DataFrame Rows in Pandas with stack, get_dummies, and groupby

Learn to convert rows of integers into a binary indicator matrix in Pandas using stack, get_dummies, groupby and transpose for a fast, loop-free solution.

Example and a naive attempt

Consider a simple Pandas DataFrame where each row contains integer positions we want to flag:

import pandas as pd
data = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [4, 5, 6],
    "C": [7, 8, 9],
}, index=[0, 1, 2])

A straightforward try for a single row is to create a range of all possible positions and check membership against that row. It does work for one row, but it doesn’t scale across the entire frame:

src_df = data
out_grid = pd.DataFrame(range(src_df.max(axis=None) + 1)).isin(list(src_df.iloc[0]))

This approach only handles the first row and doesn’t generalize to all rows at once.

What’s actually happening

The structure we want emerges if we switch perspective. If we pivot the wide table to a long series, we can one-hot encode the integer values, then aggregate by the original rows. That preserves the positions and makes it trivial to build a binary indicator matrix for every row in one pass.

Vectorized solution in Pandas

Here is a compact, fully vectorized pipeline that performs the transformation end-to-end:

result = pd.get_dummies(data.stack()).groupby(level=0).sum().T

It proceeds in four steps. First, stack reshapes the data to a long Series with a MultiIndex consisting of the original row index and the former column labels. Next, get_dummies converts the integer values into one-hot columns. Then groupby(level=0).sum() aggregates these indicators by the original rows (the first level of the MultiIndex). Finally, .T transposes so that integer positions become the row index and the original rows become columns.

The resulting output matches the intended binary indicator layout:

Peeking under the hood

If you want to see the intermediate objects for clarity, break it into named steps. This helps verify how the data flows through the pipeline without writing any loops:

s_long = data.stack()
ohe = pd.get_dummies(s_long)
agg = ohe.groupby(level=0).sum()
final = agg.T

The long series produced by stack pairs the original row index with the original column name. get_dummies converts the integer values into indicator columns. Grouping by the first index level sums indicators within each original row. Transposing rotates the table into the desired orientation with positions as the index.

Why this matters

Reshaping then one-hot encoding avoids custom loops and per-row logic while keeping everything in idiomatic Pandas. The transformation is easy to read, debuggable, and applies to any number of rows. If you were thinking about a for-loop or .apply() over rows, this pattern eliminates that need. Also, sticking with numeric indexing keeps the operations predictable and clear.

Conclusion

When you need a binary indicator matrix that marks integer positions from each row, reshape to long with stack, one-hot encode with get_dummies, aggregate by the original rows, and transpose. The single expression pd.get_dummies(data.stack()).groupby(level=0).sum().T solves the problem in a vectorized way and scales naturally across the entire DataFrame.

The article is based on a question from StackOverflow by lane-h-rogers and an answer by PaulS.

dataframe pandas python