2025, Dec 22 09:00
Why equal Python lists can yield different pickle.dumps bytes: serialization vs equality for deduplication
Equal Python lists may compare True yet yield different pickle.dumps bytes. See why serialization preserves types, and how to avoid equality and deduplication.
When you rely on pickle for persistence or deduplication, a natural question appears: if two Python lists compare equal, will their pickle.dumps outputs also be identical? This matters when you consider storing serialized blobs behind a unique constraint or when using byte-level equality as a proxy for value equality. The short answer is no — even when lists are equal under Python’s ==, their pickled bytes aren’t guaranteed to match.
Problem setup
Consider homogeneous lists where the element type T is one of bool, int, decimal, time, date, datetime, timedelta, or str. The question is whether two equal list[T] will always serialize to identical bytes with pickle.dumps. Initial ad-hoc tests might suggest consistency, but there are concrete counterexamples.
Minimal reproduction
The simplest counterexample uses the fact that values of different types can compare equal in Python. For instance, 0 == False is True. That equality does not imply identical pickled bytes, because pickle preserves the original types so it can reconstruct them precisely.
import pickle
seq_int = [0] # T is int
seq_bool = [False] # T is bool
print(seq_int == seq_bool) # True
print(pickle.dumps(seq_int) == pickle.dumps(seq_bool)) # False
Why this happens
Equality in Python can cross type boundaries. Classic examples include 1 == 1.0 and 0 == False being True. Lists that compare equal do so because their element-wise comparisons return True, but pickle’s purpose is not to canonicalize values; it aims to round-trip objects with their original types intact. As a result, two equal lists can serialize to different byte sequences if their element types differ.
This isn’t limited to bool vs int. With decimal, it’s possible to construct lists that are equal yet produce different pickles. For example, Decimal('0'), Decimal('-0'), and Decimal('0.0') can yield equal lists while producing distinct serialized outputs. Conversely, there are cases like a list containing Decimal('NaN') that is not equal to itself, yet pickle will produce the same output consistently for such a list.
Another illustrative case with Decimal
Equal lists can still result in different dumps when their elements have different decimal representations that compare equal.
import pickle
from decimal import Decimal
vals_a = [Decimal('0')]
vals_b = [Decimal('-0')]
vals_c = [Decimal('0.0')]
print(vals_a == vals_b == vals_c) # True
print(pickle.dumps(vals_a) == pickle.dumps(vals_b)) # False
print(pickle.dumps(vals_b) == pickle.dumps(vals_c)) # False
What this means for your code and data
The implication is straightforward: byte-for-byte equality of pickle.dumps should not be used as a surrogate for Python-level equality. If your goal is to store serialized data with a uniqueness guarantee that mirrors list equality, pickle’s output is the wrong key. Equal lists can map to different byte strings, and non-equal lists can map to consistent byte strings. Treat pickle as a serialization mechanism, not a canonicalizer.
Practical guidance
If your system depends on equality semantics, don’t hinge correctness on the identity of the pickled bytes. Equality in Python is about value comparison; pickle is about reconstructing original objects with the correct types. If you must deduplicate by value, compare values directly or redesign storage so that uniqueness doesn’t depend on serialized byte identity. As observed, redesigning the persistence layer is often the right move when facing these edge cases.
Why you should care
This distinction affects database schemas, cache keys, and any pipeline where you might be tempted to use raw pickle bytes as unique identifiers. Seemingly benign cases like mixing bool and int or equal-looking decimals can lead to subtle duplicates or misses. Relying on pickle for canonical identity can thus introduce correctness issues that are hard to diagnose.
Conclusion
Equal Python lists are not guaranteed to produce identical pickle.dumps outputs, even when restricted to types like bool, int, decimal, time, date, datetime, timedelta, and str. Python’s equality can span types, while pickle preserves types for accurate round-tripping. Keep equality and serialization responsibilities separate: use equality for value comparison and pickle for transport or storage, but avoid conflating the two for uniqueness or deduplication.