2025, Oct 15 13:44

Understanding Python int() with Unicode Digits: Parsing Behavior, CPython Normalization, and ASCII‑Only Validation

Learn why Python int() accepts Unicode decimal digits—so int('᪐᭒') == 2—how CPython normalizes input, and how to enforce ASCII‑only validation when needed.

When parsing strings to integers in Python, most of us assume only ASCII digits are involved. That assumption breaks the moment you bump into Unicode digits from other scripts. A small, surprising example: converting a string like ᪐᭒ with int() yields 2. It looks like the input has no numbers, yet Python happily returns a valid integer.

Reproducing the surprise

The sequence below captures the behavior end to end: the raw bytes, the code points, and the int() result.

sample = '᪐᭒'
raw_bytes = bytes(sample, 'utf-8')
hex_points = [f'U+{ord(ch):04X}' for ch in sample]
value = int(sample)
print(raw_bytes)
print(hex_points)
print(value)

This will print the UTF-8 bytes, the code points U+1A90 and U+1B52, and finally the number 2.

What is going on

The string is not random text. Both characters are Unicode decimal digits. The first, ᪐ (U+1A90), is Tai Tham Tham Digit Zero. The second, ᭒ (U+1B52), is Balinese Digit Two. Per Python’s specification for int(), “The values 0–9 can be represented by any Unicode decimal digit.” In other words, Python recognizes digits from the Unicode Decimal Number class, not just ASCII 0–9.

That is why int('᪐᭒') is effectively the same as int('02'), which evaluates to 2.

How CPython handles it internally

Under the hood, CPython routes string-to-integer conversion through a routine that normalizes the input first. The process transforms any code point with the decimal digit property into the corresponding ASCII digit, maps spaces to ASCII, and replaces non-ASCII characters that are neither digits nor spaces (starting from the first such character) with ?. Only after that normalization does the actual parsing happen in the chosen base.

You can approximate the digit mapping step with the Unicode database in Python:

import unicodedata as ud
example = '᪐᭒'
print([ud.digit(ch, "?") for ch in example])

This yields 0 and 2, illustrating how those characters are interpreted as decimal digits.

Does int() scan all languages?

It does not brute-force search for every numbering system. Instead, each character is looked up against its Unicode properties. If a character is classified as a decimal digit, its numeric value is known and mapped accordingly. If not, it’s not treated as a decimal digit. This also means int() works for decimal digits across scripts, but it will fail on non-decimal number systems in Unicode.

Solution: adjust expectations, not the parser

If you see int('᪐᭒') returning 2 and expect a failure, the fix is to align expectations with Python’s documented behavior. The language accepts any Unicode decimal digit for 0–9. Conceptually, Python first translates those digits to their ASCII counterparts and then parses the result. If you want to observe the same mapping explicitly, the quick inspection with unicodedata shown above will make the transformation visible.

For clarity, here is a compact demonstration from input to value:

text = '᪐᭒'
print([f'U+{ord(c):04X}' for c in text])  # ['U+1A90', 'U+1B52']
print(int(text))                           # 2

Why this matters for engineers

First, correctness: code that assumes ASCII-only digits will behave differently in the presence of Unicode input. Python’s behavior is deliberate and standards-driven, which is especially relevant when processing international text.

Second, performance considerations exist. Testing for digits across Unicode involves consulting Unicode metadata and is naturally slower than a tight ASCII check such as 48 <= ord(c) < 58. This is one of the ways in which Python 3’s full-Unicode strings can be slower than ASCII-centric logic. You get correctness for global text handling, at a cost that is typically optimized under the hood but cannot match a pure ASCII fast-path.

Takeaways

Python’s int() accepts any Unicode decimal digit, not just ASCII. Characters like ᪐ and ᭒ are valid digits in their respective scripts and are mapped to 0 and 2 during parsing, so int('᪐᭒') behaves like int('02'). The interpreter does not search all languages; it relies on Unicode properties to determine digit values. For parsing logic that must be script-agnostic and correct on international data, this is a feature. For code that must reject anything but ASCII digits, be explicit about input constraints and validation. Awareness of this behavior helps prevent surprises in both functionality and performance.

The article is based on a question from StackOverflow by Wör Du Schnaffzig and an answer by jonrsharpe.