2025, Dec 22 07:00
Replace overlapping tokens in Python strings: fix standalone 0 to 00 using regex word boundaries and lookarounds
Learn why Python str.replace skips overlapping matches and how to turn hex token 0 into 00 using regex, word boundaries, and lookaround for precise fixes.
Replacing overlapping patterns in strings can be counterintuitive if you expect regex-like behavior from plain string methods. A common example is turning a sequence of space-separated hex tokens where single 0 must become 00, without disturbing the spacing. Using Python’s str.replace looks obvious, but it doesn’t yield the expected result when matches touch each other.
Problem
The goal is to turn the text
28 5A 31 34 0 0 0 F0into
28 5A 31 34 00 00 00 F0Using str.replace on " 0 " feels natural, yet it misses one occurrence in the middle when patterns are consecutive.
sample = '28 5A 31 34 0 0 0 F0'
first_pass = sample.replace(' 0 ', ' 00 ')
second_pass = first_pass.replace(' 0 ', ' 00 ')
print(first_pass) # 28 5A 31 34 00 0 00 F0
print(second_pass) # 28 5A 31 34 00 00 00 F0Why it happens
The behavior follows from how replacement scans the string. Once a match is found, the engine replaces it, then continues searching immediately after the end of that match. When matches can overlap, the first replacement “consumes” the characters that would otherwise start the next match. In this case, each match is the three-character substring " 0 ". After one such segment is replaced, the next search resumes after the trailing space of that consumed segment, skipping the middle zero in a run like " 0 0 0 ".
A concrete analogy helps. Consider mapping " 0 " to ABA and " 00 " to X__X. Then replacing ABA in ABABABA produces this:
'ABABABA'.replace('ABA', 'X__X')
# yields 'X__XBX__X'If overlapping matches were also replaced independently, you would end up with extra characters from repeatedly inserting the replacement for overlapping segments, conceptually like X__XX__XX__X. That outcome would be undesirable here because it would introduce extra spaces in the original problem.
There’s another implication. If replacement didn’t advance after a consumed match, trivial code like 'A'.replace('A', 'AA') would never terminate, because each replacement would create another valid match immediately.
Solution
A robust approach is to use regex substitution with precise matching semantics. Matching a standalone 0 using word boundaries \b will hit zeros at the start, middle, or end of the string as long as they behave like a token.
import re
text_line = '28 5A 31 34 0 0 0 F0'
fixed = re.sub(r'\b0\b', '00', text_line)
print(fixed) # 28 5A 31 34 00 00 00 F0This also handles a 0 at the beginning or end of the string without needing surrounding spaces.
If you specifically want to match only zeros surrounded by spaces and keep the spaces untouched by the match, use lookaround so the space characters are required but not consumed:
import re
payload = '28 5A 31 34 0 0 0 F0'
patched = re.sub(r'(?<= )0(?= )', '00', payload)
print(patched) # 28 5A 31 34 00 00 00 F0What the documentation means
The documentation for str.replace states that all occurrences of the old substring are replaced. The key is that matching proceeds left-to-right, and each successful match consumes the matched span before continuing. Under that rule, all occurrences are indeed replaced, but only the non-overlapping ones that remain visible as the scan progresses. This is consistent with how both str.replace and re.sub operate unless lookarounds are used to constrain consumption.
Why this matters
String preprocessing pipelines are full of token-like edits where naive substring replacement looks sufficient. When patterns can touch, replacement semantics determine whether you miss cases, accidentally duplicate spacing, or worse, drift into unintended transformations. Knowing that simple replacement operates on non-overlapping matches—and that regex lookaround lets you control which characters are consumed—prevents subtle bugs.
Practical takeaways
If you must expand a standalone token across a text stream, prefer regex with \b boundaries. If you need to require, yet preserve, context such as spaces, use positive lookbehind and lookahead. Chaining replace twice can work for short inputs, but it is brittle and obscures intent. The regex approach makes the rule explicit and handles edge positions cleanly.