2026, Jan 01 21:00
Memory-efficient Python mapping for millions of int IDs using a 32-bit array-backed open-addressing hash table
Learn how to cut Python dict memory by 10x for millions of int IDs using a 32-bit array.array open-addressing hash table, with code and practical tips.
When you store millions of opaque IDs as Python int in dict-based structures, the time complexity might look fine, but memory use does not. Each int is a full Python object, and each dict entry expects Python objects as keys and values. The result is an order-of-magnitude gap between the payload you intend to store and the memory Python actually allocates.
Minimal example that triggers high memory usage
The following illustrates the pattern: millions of integer IDs mapped to integers via a regular dict. It runs quickly, but the memory footprint is dominated by Python object overhead:
from pympler.asizeof import asizeof
orig_map = dict(zip(range(1500000), range(1500000)))
print(asizeof(orig_map)) # Example footprint measurement
What actually hurts memory
In CPython, dict keys and values are Python objects. That means every int carries object headers plus the dict’s own pointers and bookkeeping. The numeric payload is tiny, but the per-object overhead is not. If your IDs are opaque and fit in 32 bits, there is no intrinsic need to pay for a full Python object per ID; the cost appears because the standard dict works with Python objects by design. To drop the overhead, you must also avoid containers that require Python objects as elements.
A lean approach: an array-backed open-addressing hash table
The way to actually store IDs in about 4 bytes each is to implement a custom hash table that keeps raw 32-bit integers in contiguous arrays. Using array.array for both keys and values removes Python object overhead for each entry. Empty and deleted slots can be represented with sentinel values. For 32-bit IDs, 0 can serve as the empty marker and 2**32 - 1 as the deleted marker, assuming those do not occur as real IDs. The following implementation uses linear probing to resolve collisions and keeps the interface close to a mapping.
from array import array
class Int32OpenAddressMap:
VACANT = 0
TOMBSTONE = (1 << 32) - 1
def __init__(self, seed=None, capacity=8, max_load=0.75):
self._cap = capacity
self._max_load = max_load
self._used = 0
self._kbuf = array('L', [self.VACANT]) * capacity
self._vbuf = array('L', [self.VACANT]) * capacity
if seed is not None:
self.update(seed)
def _scan(self, k):
idx = hash(k) % self._cap
for _ in range(self._cap):
yield idx, self._kbuf[idx], self._vbuf[idx]
idx = (idx + 1) % self._cap
def __setitem__(self, k, v):
while self._used >= self._max_load * self._cap:
grown = Int32OpenAddressMap(self, self._cap * 2, self._max_load)
self._cap = grown._cap
self._kbuf = grown._kbuf
self._vbuf = grown._vbuf
for idx, pk, pv in self._scan(k):
if pv == self.TOMBSTONE:
continue
if pv == self.VACANT:
self._kbuf[idx] = k
self._vbuf[idx] = v
self._used += 1
return
elif pk == k:
self._vbuf[idx] = v
return
def __getitem__(self, k):
for _, pk, pv in self._scan(k):
if pv == self.VACANT:
break
if pv == self.TOMBSTONE:
continue
if pk == k:
return pv
raise KeyError(k)
def __delitem__(self, k):
for idx, pk, pv in self._scan(k):
if pv == self.VACANT:
raise KeyError(k)
if pv == self.TOMBSTONE:
continue
if pk == k:
self._vbuf[idx] = self.TOMBSTONE
self._used -= 1
return
def items(self):
for kk, vv in zip(self._kbuf, self._vbuf):
if vv not in (self.VACANT, self.TOMBSTONE):
yield kk, vv
def keys(self):
for kk, _ in self.items():
yield kk
def values(self):
for _, vv in self.items():
yield vv
def __iter__(self):
yield from self.keys()
def __len__(self):
return self._used
def __eq__(self, other):
return set(self.items()) == set(other.items())
def __contains__(self, k):
try:
self[k]
except KeyError:
return False
return True
def get(self, k, default=None):
try:
return self[k]
except KeyError:
return default
def __repr__(self):
return repr(dict(self.items()))
def __str__(self):
return repr(self)
def copy(self):
return Int32OpenAddressMap(self, self._cap, self._max_load)
def update(self, other):
for kk, vv in other.items():
self[kk] = vv
This structure stores both keys and values as 32-bit unsigned integers. It uses sentinels for empty and deleted slots, and it resizes when the configured load factor is reached.
Using it and measuring the footprint
When measured recursively with pympler.asizeof, the savings are drastic because you remove the per-object overhead of Python ints. The following snippet demonstrates the difference on a large mapping of int to int:
from pympler.asizeof import asizeof
baseline = dict(zip(range(1500000), range(1500000)))
compact = Int32OpenAddressMap(baseline)
print(asizeof(baseline)) # 179877936
print(asizeof(compact)) # 16777920
On some platforms, array('L') may allocate 8 bytes per element instead of 4. In that environment, use array('I') to ensure 4-byte items.
Why this matters for large-scale ID processing
When the algorithm fundamentally needs all IDs in memory and uses dict-like lookups, reducing the storage type from Python objects to raw 32-bit integers cuts the memory footprint by an order of magnitude. The speed characteristics remain practical thanks to open addressing and linear probing, while the main win is avoiding per-object metadata for every ID and value pair.
Conclusion
If your workload is dominated by millions of opaque IDs, Python’s object model is the source of the memory blow-up, not the numerical value itself. Storing IDs as raw 32-bit integers in an array-backed hash table avoids object overhead and keeps the mapping interface you rely on. Keep sentinels reserved for empty and deleted slots, choose an appropriate load factor, and be aware of the array type code size on your platform. With these adjustments, you retain hash-table semantics while bringing memory use much closer to the true payload size.