https://pytroubles.com/en/posts/id1374-pandas-method-chaining-for-publisher-sales-shares-vectorized-other-bucketing-and-pie-chart

Pandas method chaining for publisher sales shares: vectorized 'Other' bucketing and pie chart

Compute publisher sales shares in pandas without loops: chaining groupby, assign, mask, and a pie chart

Pandas method chaining for publisher sales shares: vectorized 'Other' bucketing and pie chart

Learn to replace loop-heavy pandas code with a vectorized pipeline: groupby, assign, mask, bucket small publishers as 'Other', then plot sales share as a pie.

2025-10-30T13:00:08+03:00

2025-10-30T13:00:09+03:00

Turning exploratory data work into clear, reproducible code often starts with getting rid of unnecessary intermediates and loops. A common case in pandas is aggregating metrics, computing percentages, bucketing small contributors under “Other”, and drawing a chart. Below is a compact walkthrough of how to replace a loop-based approach with a chained, vectorized pipeline while preserving the exact behavior.Problem overviewThe task is to calculate each publisher’s share of global video game sales from the Kaggle videogamesales dataset, collapse publishers contributing less than 2% into a single “Other” group, and render a pie chart of the result. The initial implementation works, but it creates intermediate objects, mutates state across steps, and uses a manual loop to relabel items below the threshold.Baseline code that needs simplificationThe following snippet demonstrates the original approach with explicit intermediates and a for loop. The logic is intact, but names are adjusted for clarity.pub_sales_tbl = games_df.groupby('Publisher')[['Global_Sales']].sum().sort_values('Global_Sales', ascending=False) all_sales = pub_sales_tbl['Global_Sales'].sum() pub_sales_tbl['Share'] = pub_sales_tbl['Global_Sales'] / all_sales pos = 0 cutoff = 0.02 publisher_labels = list(pub_sales_tbl.index) for ratio in pub_sales_tbl['Share']: if ratio < cutoff: publisher_labels[pos] = 'Other' pos = pos + 1 pub_sales_tbl.index = publisher_labels pub_sales_tbl = pub_sales_tbl.groupby(pub_sales_tbl.index).sum().sort_values('Global_Sales', ascending=False) plt.title("most profitable publisehr") plt.ylabel("") pie_ax = pub_sales_tbl['Global_Sales'].plot(kind='pie', figsize=(10, 5), legend=False) pie_ax.set_ylabel("") plt.show() What’s going on and why it’s suboptimalThis workflow computes totals, proportion, and a boolean condition correctly, but then walks the index positions with a counter to relabel entries below the threshold. That pattern is both error-prone and unnecessary in pandas, which already offers vectorized operations to transform columns and reassign labels in one pass. Even a small improvement like using enumerate for the index would still leave the core issue untouched: looping row by row is not needed here. The end result should be identical, but it can be expressed declaratively and composed step by step without creating and mutating intermediate variables in the global namespace.A chained, vectorized solutionThe following pipeline uses method chaining to compute the same result from the same dataset. It produces the pie chart and avoids explicit loops and extra variables. The returned object is a matplotlib figure; you could also adapt it to return the aggregated data used for plotting. The dataset is relatively small, so the pipeline favors clarity over micro-optimizations.import pandas as pd records = pd.read_csv('vgsales.csv') ( records .groupby('Publisher', as_index=False)['Global_Sales'].sum() .rename(columns={'Global_Sales': 'alltime_sales'}) .assign( alltime_sales_percent=lambda frame: frame['alltime_sales'].div(frame['alltime_sales'].sum()).mul(100) ) .assign( major_publisher_name=lambda frame: frame['Publisher'].mask(frame['alltime_sales_percent'] < 2, 'Other') ) .pipe(lambda frame: display(frame) or frame) .groupby('major_publisher_name')['alltime_sales_percent'].sum() .sort_values(ascending=False) .plot.pie( title='All-time sales share of major publishers', ylabel='', figsize=(5, 5), cmap='tab20b' ) ); Why this mattersChaining keeps the full data story in one readable pipeline: aggregate, express as percent, relabel by condition, roll up, sort, and plot. It reduces mental overhead, avoids namespace clutter, and makes each step auditable. When you need to adjust a threshold or inspect an intermediate transformation, it’s a one-line change rather than a multi-variable refactor. The approach also aligns with idiomatic pandas patterns and scales better conceptually than manual loops for row-wise edits.Practical takeawaysWhen you need to bucket small categories into an “Other” slice and visualize their share, prefer vectorized transformations such as assign and mask paired with groupby. If you ever stick to a loop, replacing manual index counters with enumerate is still a small quality-of-life improvement, but in this case eliminating the loop entirely is simpler and more reliable. The figure returned by the chained call can be used as-is, or you can tweak the last step to return the aggregated series for further reuse.A concise pipeline reduces the surface area for bugs and makes the intent obvious: compute shares once, label by threshold, aggregate, and plot. That is often the difference between processing code that merely works and code that is easy to maintain and revisit.

pandas method chaining, vectorized pandas, groupby assign mask, bucket to Other, publisher sales share, Kaggle video game sales, pie chart, data visualization, replace loops, reproducible code

2025

2025, Oct 30 13:00

Compute publisher sales shares in pandas without loops: chaining groupby, assign, mask, and a pie chart

Learn to replace loop-heavy pandas code with a vectorized pipeline: groupby, assign, mask, bucket small publishers as 'Other', then plot sales share as a pie.

Problem overview

The task is to calculate each publisher’s share of global video game sales from the Kaggle videogamesales dataset, collapse publishers contributing less than 2% into a single “Other” group, and render a pie chart of the result. The initial implementation works, but it creates intermediate objects, mutates state across steps, and uses a manual loop to relabel items below the threshold.

Baseline code that needs simplification

The following snippet demonstrates the original approach with explicit intermediates and a for loop. The logic is intact, but names are adjusted for clarity.

pub_sales_tbl = games_df.groupby('Publisher')[['Global_Sales']].sum().sort_values('Global_Sales', ascending=False)
all_sales = pub_sales_tbl['Global_Sales'].sum()
pub_sales_tbl['Share'] = pub_sales_tbl['Global_Sales'] / all_sales
pos = 0
cutoff = 0.02
publisher_labels = list(pub_sales_tbl.index)
for ratio in pub_sales_tbl['Share']:
    if ratio < cutoff:
        publisher_labels[pos] = 'Other'
    pos = pos + 1
pub_sales_tbl.index = publisher_labels
pub_sales_tbl = pub_sales_tbl.groupby(pub_sales_tbl.index).sum().sort_values('Global_Sales', ascending=False)
plt.title("most profitable publisehr")
plt.ylabel("")
pie_ax = pub_sales_tbl['Global_Sales'].plot(kind='pie', figsize=(10, 5), legend=False)
pie_ax.set_ylabel("")
plt.show()

What’s going on and why it’s suboptimal

This workflow computes totals, proportion, and a boolean condition correctly, but then walks the index positions with a counter to relabel entries below the threshold. That pattern is both error-prone and unnecessary in pandas, which already offers vectorized operations to transform columns and reassign labels in one pass. Even a small improvement like using enumerate for the index would still leave the core issue untouched: looping row by row is not needed here. The end result should be identical, but it can be expressed declaratively and composed step by step without creating and mutating intermediate variables in the global namespace.

A chained, vectorized solution

The following pipeline uses method chaining to compute the same result from the same dataset. It produces the pie chart and avoids explicit loops and extra variables. The returned object is a matplotlib figure; you could also adapt it to return the aggregated data used for plotting. The dataset is relatively small, so the pipeline favors clarity over micro-optimizations.

import pandas as pd
records = pd.read_csv('vgsales.csv')
(
    records
    .groupby('Publisher', as_index=False)['Global_Sales'].sum()
    .rename(columns={'Global_Sales': 'alltime_sales'})
    .assign(
        alltime_sales_percent=lambda frame: frame['alltime_sales'].div(frame['alltime_sales'].sum()).mul(100)
    )
    .assign(
        major_publisher_name=lambda frame: frame['Publisher'].mask(frame['alltime_sales_percent'] < 2, 'Other')
    )
    .pipe(lambda frame: display(frame) or frame)
    .groupby('major_publisher_name')['alltime_sales_percent'].sum()
    .sort_values(ascending=False)
    .plot.pie(
        title='All-time sales share of major publishers',
        ylabel='', figsize=(5, 5), cmap='tab20b'
    )
);

Why this matters

Chaining keeps the full data story in one readable pipeline: aggregate, express as percent, relabel by condition, roll up, sort, and plot. It reduces mental overhead, avoids namespace clutter, and makes each step auditable. When you need to adjust a threshold or inspect an intermediate transformation, it’s a one-line change rather than a multi-variable refactor. The approach also aligns with idiomatic pandas patterns and scales better conceptually than manual loops for row-wise edits.

Practical takeaways

When you need to bucket small categories into an “Other” slice and visualize their share, prefer vectorized transformations such as assign and mask paired with groupby. If you ever stick to a loop, replacing manual index counters with enumerate is still a small quality-of-life improvement, but in this case eliminating the loop entirely is simpler and more reliable. The figure returned by the chained call can be used as-is, or you can tweak the last step to return the aggregated series for further reuse.

A concise pipeline reduces the surface area for bugs and makes the intent obvious: compute shares once, label by threshold, aggregate, and plot. That is often the difference between processing code that merely works and code that is easy to maintain and revisit.

The article is based on a question from StackOverflow by just a tw highschooler and an answer by MuhammedYunus SaveGaza.

matplotlib pandas python