2025, Oct 22 15:00
Why RGB Histograms Don’t Match KMeans Dominant Color Palettes: Counting Exact Pixels vs Clustering
Learn why top colors from an RGB histogram differ from dominant colors via KMeans. Compare exact pixel counts with palette clustering and Pillow getcolors.
Two scripts can look like they solve the same problem and still produce wildly different answers. That’s exactly what happens when you compare “top 10 colors” from a raw RGB histogram with the “dominant colors” some online tools return. The former counts exact pixel triplets. The latter typically compresses the image into a 10‑color palette using clustering. Those are not the same question.
Problem setup
You read an image with Pillow, enumerate all colors, and keep the ten most common ones. Running the same image through an online service yields a very different list. The suspicion is that PIL may be processing colors differently.
Reproducible code that counts exact colors
The following snippet takes the unsorted list from getcolors, tracks a running “top 10” by frequency, then sorts that shortlist by count. The logic deliberately counts exact RGB triplets; it does not group similar shades.
from PIL import Image
pic = Image.open(image_url)
color_bag = pic.getcolors(maxcolors=100000)  # every unique color with its frequency
best_ten = color_bag[:10]
for entry in color_bag[9:]:
    tally = entry[0]
    current_counts = []
    for chosen in best_ten:
        current_counts.append(chosen[0])
    if tally > min(current_counts):
        replace_at = current_counts.index(min(current_counts))
        best_ten[replace_at] = entry
ranked_best = sorted(best_ten, key=lambda item: item[0], reverse=True)
print(ranked_best)
print(best_ten)
What’s really causing the discrepancy
The two outputs disagree because they answer different questions. Counting exact triplets returns the most frequent pixel values in the image, which can easily be many close shades of the same gray. The online result instead behaves like a palette extraction. It selects ten representative colors that summarize the whole image rather than the ten most repeated exact RGB values. That behavior matches KMeans clustering on pixel values.
There’s also a separate detail: getcolors returns an unsorted list, and code that maintains a running top-k like above is sensitive to that order. In other words, garbage in, garbage out if you rely on input order. But even if you fix the ordering, a frequency list still won’t match a clustering-based palette, because the goals differ.
Two ways to compute “top colors,” depending on the goal
If the goal is literally “the 10 most frequent exact RGB triplets,” a simple histogram is enough. If the goal is “the 10 colors you’d keep if you compressed the image to a palette of 10,” then you want KMeans cluster centers.
Exact frequencies using numpy + Counter:
from collections import Counter
import numpy as np
from PIL import Image
pixels = np.asarray(Image.open('image.jpg'))
freqs = Counter(tuple(v) for v in pixels.reshape(-1, 3))
print(freqs.most_common(10))
Cluster-based palette using KMeans:
from sklearn.cluster import KMeans
import numpy as np
from PIL import Image
pixels = np.asarray(Image.open('image.jpg'))
km = KMeans(n_clusters=10, random_state=0, n_init="auto")
km.fit(pixels.reshape(-1, 3))
print(km.cluster_centers_)
The cluster centers are barycenters of pixel groups and are not guaranteed to be actual pixel values. That is expected. Also, reshaping the H×W×3 array to N×3 is simply a view when data are contiguous, so this step is effectively free in typical cases.
Why this matters
Choosing the method changes the answer. A strict histogram can put multiple nearly identical dark tones in the top 10 if the scene is dominated by them. A KMeans palette will usually include lighter tones and rare but visually important hues, because it summarizes the image by grouping similar colors together and representing each group by its center. If you plan to quantize an image to 10 colors, cluster centers are the right target. If you’re auditing for exact repeat use of colors, a histogram is correct.
Takeaways
Define “top colors” before coding. If you mean most frequent exact RGB values, use a frequency count. If you mean dominant palette colors, use KMeans. Don’t expect both approaches to match; they intentionally optimize for different outcomes. And if you rely on getcolors, don’t assume its output is ordered—sort explicitly if your logic needs consistent behavior.