2026, Jan 01 21:00

Memory-efficient Python mapping for millions of int IDs using a 32-bit array-backed open-addressing hash table

Learn how to cut Python dict memory by 10x for millions of int IDs using a 32-bit array.array open-addressing hash table, with code and practical tips.

When you store millions of opaque IDs as Python int in dict-based structures, the time complexity might look fine, but memory use does not. Each int is a full Python object, and each dict entry expects Python objects as keys and values. The result is an order-of-magnitude gap between the payload you intend to store and the memory Python actually allocates.

Minimal example that triggers high memory usage

The following illustrates the pattern: millions of integer IDs mapped to integers via a regular dict. It runs quickly, but the memory footprint is dominated by Python object overhead:

from pympler.asizeof import asizeof

orig_map = dict(zip(range(1500000), range(1500000)))
print(asizeof(orig_map))  # Example footprint measurement

What actually hurts memory

In CPython, dict keys and values are Python objects. That means every int carries object headers plus the dict’s own pointers and bookkeeping. The numeric payload is tiny, but the per-object overhead is not. If your IDs are opaque and fit in 32 bits, there is no intrinsic need to pay for a full Python object per ID; the cost appears because the standard dict works with Python objects by design. To drop the overhead, you must also avoid containers that require Python objects as elements.

A lean approach: an array-backed open-addressing hash table

The way to actually store IDs in about 4 bytes each is to implement a custom hash table that keeps raw 32-bit integers in contiguous arrays. Using array.array for both keys and values removes Python object overhead for each entry. Empty and deleted slots can be represented with sentinel values. For 32-bit IDs, 0 can serve as the empty marker and 2**32 - 1 as the deleted marker, assuming those do not occur as real IDs. The following implementation uses linear probing to resolve collisions and keeps the interface close to a mapping.

from array import array

class Int32OpenAddressMap:
    VACANT = 0
    TOMBSTONE = (1 << 32) - 1

    def __init__(self, seed=None, capacity=8, max_load=0.75):
        self._cap = capacity
        self._max_load = max_load
        self._used = 0
        self._kbuf = array('L', [self.VACANT]) * capacity
        self._vbuf = array('L', [self.VACANT]) * capacity
        if seed is not None:
            self.update(seed)

    def _scan(self, k):
        idx = hash(k) % self._cap
        for _ in range(self._cap):
            yield idx, self._kbuf[idx], self._vbuf[idx]
            idx = (idx + 1) % self._cap

    def __setitem__(self, k, v):
        while self._used >= self._max_load * self._cap:
            grown = Int32OpenAddressMap(self, self._cap * 2, self._max_load)
            self._cap = grown._cap
            self._kbuf = grown._kbuf
            self._vbuf = grown._vbuf
        for idx, pk, pv in self._scan(k):
            if pv == self.TOMBSTONE:
                continue
            if pv == self.VACANT:
                self._kbuf[idx] = k
                self._vbuf[idx] = v
                self._used += 1
                return
            elif pk == k:
                self._vbuf[idx] = v
                return

    def __getitem__(self, k):
        for _, pk, pv in self._scan(k):
            if pv == self.VACANT:
                break
            if pv == self.TOMBSTONE:
                continue
            if pk == k:
                return pv
        raise KeyError(k)

    def __delitem__(self, k):
        for idx, pk, pv in self._scan(k):
            if pv == self.VACANT:
                raise KeyError(k)
            if pv == self.TOMBSTONE:
                continue
            if pk == k:
                self._vbuf[idx] = self.TOMBSTONE
                self._used -= 1
                return

    def items(self):
        for kk, vv in zip(self._kbuf, self._vbuf):
            if vv not in (self.VACANT, self.TOMBSTONE):
                yield kk, vv

    def keys(self):
        for kk, _ in self.items():
            yield kk

    def values(self):
        for _, vv in self.items():
            yield vv

    def __iter__(self):
        yield from self.keys()

    def __len__(self):
        return self._used

    def __eq__(self, other):
        return set(self.items()) == set(other.items())

    def __contains__(self, k):
        try:
            self[k]
        except KeyError:
            return False
        return True

    def get(self, k, default=None):
        try:
            return self[k]
        except KeyError:
            return default

    def __repr__(self):
        return repr(dict(self.items()))

    def __str__(self):
        return repr(self)

    def copy(self):
        return Int32OpenAddressMap(self, self._cap, self._max_load)

    def update(self, other):
        for kk, vv in other.items():
            self[kk] = vv

This structure stores both keys and values as 32-bit unsigned integers. It uses sentinels for empty and deleted slots, and it resizes when the configured load factor is reached.

Using it and measuring the footprint

When measured recursively with pympler.asizeof, the savings are drastic because you remove the per-object overhead of Python ints. The following snippet demonstrates the difference on a large mapping of int to int:

from pympler.asizeof import asizeof

baseline = dict(zip(range(1500000), range(1500000)))
compact = Int32OpenAddressMap(baseline)
print(asizeof(baseline))  # 179877936
print(asizeof(compact))   # 16777920

On some platforms, array('L') may allocate 8 bytes per element instead of 4. In that environment, use array('I') to ensure 4-byte items.

Why this matters for large-scale ID processing

When the algorithm fundamentally needs all IDs in memory and uses dict-like lookups, reducing the storage type from Python objects to raw 32-bit integers cuts the memory footprint by an order of magnitude. The speed characteristics remain practical thanks to open addressing and linear probing, while the main win is avoiding per-object metadata for every ID and value pair.

Conclusion

If your workload is dominated by millions of opaque IDs, Python’s object model is the source of the memory blow-up, not the numerical value itself. Storing IDs as raw 32-bit integers in an array-backed hash table avoids object overhead and keeps the mapping interface you rely on. Keep sentinels reserved for empty and deleted slots, choose an appropriate load factor, and be aware of the array type code size on your platform. With these adjustments, you retain hash-table semantics while bringing memory use much closer to the true payload size.

data-structures memory python python-internals space-complexity