2025, Oct 19 08:00

Extract Roman Numerals with a Single Regex: No Empty Matches in Python and PCRE2

Learn a precise regex to extract Roman numerals from prose without zero-length matches. Works in Python and PCRE2 (case-insensitive). Avoid empty captures.

Extracting Roman numerals from running text looks straightforward until a regular expression quietly produces zero-length matches. Those empty captures pollute results and are awkward to handle downstream, especially if you must solve it with a single regex in case-insensitive mode for Python or PCRE2. Here is a precise fix that eliminates empty matches while keeping the intended behavior.

Problem setup

Consider this sample text used to test Roman numeral matching inside normal sentences and standalone lines:

Charles I was a bad king, I was not.
Charles X was a good one.
Who was Louis XVI?
The year is MCMXCIX, the month is June.
Do you need an X-ray, do you think?
My friends Cil and Cleo met me for coffee.
MCMLXIX

One of the stronger patterns for canonical Roman numerals is:

(?=\b[MCDXLVI]+\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})(?!-)\b

Used case-insensitively, it finds proper Roman numerals here, but it also yields two empty matches: right before the X in X-ray, and right before Cil. The lookahead accepts those positions because the following word characters are all Roman letters, yet the core numeral subpattern can legally consume nothing, so the engine reports a zero-length match at those positions.

Why the empty matches happen

The leading assertion checks only that Roman letters are coming up, not that the pattern consumes them. The numeral body is composed of optional pieces like M{0,4}, D?C{0,3}, and so on, each of which can match an empty string. When the trailing boundary allows it, the entire pattern can succeed without advancing the position in the subject text. That is exactly what occurs before X in X-ray and before Cil: the lookahead “sees” valid letters, the body chooses all empty branches, and the end boundary passes, resulting in a zero-length match.

The fix

The key is to make the end-of-match constraint prove that the Roman letters seen up front were actually consumed. Replace the boundary-and-dash check with an assertion that forbids trailing alphanumerics and dashes. That aligns the end of the match away from characters the lookahead had recognized.

Change the tail from (?!-)\b to (?![\w-]). With that in place, the initial lookahead can be simplified, because we only need to assert we are at a word boundary before a word character.

Revised pattern:

\b(?=\w)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})(?![\w-])

This prevents the engine from reporting a match unless the Roman numeral components consumed the word characters that were observed ahead, eliminating those zero-length results.

Optional tidy-up

The alternations for subtractive pairs can be written more compactly without changing behavior. You can shorten CM|CD to C[MD], XC|XL to X[CL], and IX|IV to I[XV]. If you prefer that style, a compact equivalent is:

\b(?=\w)M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})(?![\w-])

Why this matters

Zero-length matches are subtle. They inflate match counts, complicate downstream processing, and can trigger unexpected behavior in scanning loops. When you rely on a single regular expression to extract Roman numerals from prose, ensuring that the pattern cannot succeed without consuming text saves you from hard-to-spot bugs.

Takeaways

If a regex starts with a permissive lookahead and its main body contains optional or empty-allowed pieces, it can succeed without consuming input. Pair the start and end assertions so the end explicitly rules out continuing alphanumerics (and here, hyphens), forcing the body to consume what the start saw. For this Roman numeral case in Python or PCRE2 with case-insensitive mode, using \b(?=\w) at the front and (?![\w-]) at the end cleanly removes empty matches while keeping intended hits inside normal text.

The article is based on a question from StackOverflow by vr8ce and an answer by trincot.