2026, Jan 05 11:00

Wheel factorization in the Sieve of Eratosthenes: why a 210-based wheel isn't much faster than 30 in Python Numba

Benchmarks of Python Numba sieves show a 210-based wheel offers only marginal gains over a 30-wheel. Learn why slice-based composite marking dominates runtime.

Wheel factorization looks like an easy win on top of the classic Sieve of Eratosthenes: if you skip non-coprime residues, you shrink the search space dramatically. Moving from a 30-based wheel to a 210-based wheel cuts candidates further, so you might expect a meaningful speedup. Yet benchmarks show almost no improvement and sometimes a slight slowdown. Let’s break down why that happens, what exactly the code is doing, and where the time actually goes.

Prime residues and wheels, in practice

All primes except 2, 3, 5 must be coprime to 30, so modulo 30 only 8 residues can be prime candidates. In Python that set comes straight from gcd:

residues30 = [x for x in range(1, 30) if math.gcd(x, 30) == 1]
# [1, 7, 11, 13, 17, 19, 23, 29]

If you take differences between consecutive residues (wrapping 29→1), you get step increments [4, 2, 4, 2, 4, 6, 2, 6]. That’s the 30-wheel pattern. Extending the idea to 2*3*5*7 = 210, all primes starting from 11 must be coprime to 210, yielding 48 residues and the corresponding increment pattern of length 48. In theory, the 210-based wheel processes 48 candidates out of 210 numbers instead of 56 out of 210 with the 30-based wheel. That is where the expectation of a roughly 14.29% speedup comes from.

Code example: the two sieve variants

Below are two Numba-compiled implementations of the sieve that differ only in their wheel. The first uses the 30-based pattern and starts at 7. The second uses the 210-based pattern and starts at 11. Both versions pre-handle the small primes needed for their start point and then roll the wheel to iterate over candidate primes.

import math
import numpy as np
import numba as nb
@nb.njit(cache=True)
def sieve_wheel30(limit: int) -> np.ndarray:
    jumps = [4, 2, 4, 2, 4, 6, 2, 6]
    flags = np.ones(limit + 1, dtype=np.uint8)
    flags[:2] = False
    for sq, step in ((4, 2), (9, 6), (25, 10)):
        flags[sq::step] = False
    p = 7
    cap = int(math.sqrt(limit) + 0.5)
    idx = 0
    while p <= cap:
        if flags[p]:
            flags[p * p :: 2 * p] = False
        p += jumps[idx]
        idx = (idx + 1) % 8
    return np.nonzero(flags)[0]
@nb.njit(cache=True)
def sieve_wheel210(limit: int) -> np.ndarray:
    jumps = [
        2 , 4 , 2 , 4 , 6 , 2 ,
        6 , 4 , 2 , 4 , 6 , 6 ,
        2 , 6 , 4 , 2 , 6 , 4 ,
        6 , 8 , 4 , 2 , 4 , 2 ,
        4 , 8 , 6 , 4 , 6 , 2 ,
        4 , 6 , 2 , 6 , 6 , 4 ,
        2 , 4 , 6 , 2 , 6 , 4 ,
        2 , 4 , 2 , 10, 2 , 10
    ]
    flags = np.ones(limit + 1, dtype=np.uint8)
    flags[:2] = False
    for sq, step in ((4, 2), (9, 6), (25, 10), (49, 14)):
        flags[sq::step] = False
    p = 11
    cap = int(math.sqrt(limit) + 0.5)
    idx = 0
    while p <= cap:
        if flags[p]:
            flags[p * p :: 2 * p] = False
        p += jumps[idx]
        idx = (idx + 1) % 48
    return np.nonzero(flags)[0]

What’s actually going on

The experimental results show that both functions return the same primes but run in almost the same time across a wide range of input sizes. In initial tests, the 210-based wheel is sometimes slightly slower. After controlling for incidental differences and making the two versions structurally identical except for the wheel itself, the 210-based version becomes a tiny bit faster but still far from the expected 14.29% improvement.

The speedup is insignificant because you only sped up an insignificant part of the overall work. Most time is spent by the primes[...] = False commands, and they're the same for both wheels: You do that for each prime up to sqrt(n). It doesn't matter that the larger wheel finds those primes a bit faster.

In other words, the dominant cost in these implementations is crossing off composites with slice assignments. That work is done for each prime up to the square root of the limit. The wheel size doesn’t change how many such slice writes occur; it only affects how you step through candidate positions to test if they are prime. Because the composite-marking slices dominate runtime, making candidate stepping slightly cheaper barely moves the needle.

The fix that isn’t needed: why there’s no “corrected” code

There’s nothing functionally wrong with either implementation. The larger wheel reduces the number of candidate checks, but the code path that consumes most time remains unchanged and equally frequent. That is why there is no meaningful wall-clock benefit. The controlled rewrite that aligns loop structure and indexing between the two versions confirms the same conclusion: the 210-based wheel runs just a hair faster, yet the improvement is negligible.

Why this matters

This is a practical reminder that theoretical reductions in search space do not automatically translate into proportional runtime wins. If most of the cost sits elsewhere—in this case, in the composite marking via slice assignments—then optimizing the candidate iteration will not produce the gains you expect. Measurement and bottleneck awareness are essential, even for algorithms that look obviously optimizable on paper.

Takeaways

Wheel factorization is useful for narrowing candidates, but end-to-end performance depends on where time is actually spent. Here, both the 30-based and 210-based wheels perform the same dominant work: marking composites for each prime up to sqrt(limit). That makes the larger wheel’s advantage marginal in practice. Expect correctness and minor improvements, not a double-digit speedup, unless you change the work that really dominates the runtime.