2025, Dec 02 21:00

Tokenize math expressions by top-level +, -, *, / and keep all balanced parentheses intact

Learn a simple stateful tokenizer to split math expressions by top-level operators without breaking balanced parentheses. Avoid fragile regex splits. Fast.

Split a math expression by top-level operators without breaking parentheses

Tokenizing a mathematical string by operators sounds easy until you need to keep everything inside matching parentheses intact. The goal is to break on +, -, *, / only when they appear at the top level, while leaving nested subexpressions and names like math.sqrt or symbols like π untouched inside their parentheses.

Problem demonstration

Consider these expressions and the desired segmentation. Operators outside parentheses should act as separators; everything inside balanced parentheses must stay as a single piece.

import math
s1 = "5+5*10"
expected1 = ["5", "+", "5", "*", "10"]
s2 = "(2*2)-5*(math.sqrt(9)+2)"
expected2 = ["(2*2)", "-", "5", "*", "(math.sqrt(9)+2)"]
s3 = "(((5-3)/2)*0.5)+((2*2))*(((math.log(5)+2)-2))"
expected3 = ["(((5-3)/2)*0.5)", "+", "((2*2))", "*", "(((math.log(5)+2)-2))"]

A straight regex split is tempting, for example:

import re
expr = "(((5-3)/2)*0.5)+((2*2))*(((math.log(5)+2)-2))"
parts = re.split(r"([\+|\-|\*|\/]|\(.*\))", expr)

But this does not respect balanced parentheses and fails with nested structures.

Why the naive regex breaks

Matching on operators is fine, but keeping subexpressions with arbitrarily nested parentheses intact requires tracking how deep you are inside parentheses. Simple regex patterns like (.*) are greedy and do not understand balance, and plain split logic cannot see nesting. The result is either over-splitting inside parentheses or capturing too much.

A simple stateful tokenizer

A small custom tokenizer solves this cleanly. The idea is to stream through the string, maintain a counter of current parentheses depth, and only split on operators when the depth is zero. As long as you assume syntactically correct input, this is straightforward.

OPS_SET = set("+-*/")
PAREN_SHIFT = {"(": 1, ")": -1}
def split_top_level(expr: str) -> list[str]:
    tokens = [""]
    depth = 0
    for ch in expr:
        depth += PAREN_SHIFT.get(ch, 0)
        if ch in OPS_SET:
            if depth == 0:
                tokens.extend([ch, ""])
                continue
        tokens[-1] += ch
    return tokens
samples = [
    "5+5*10",
    "(2*2)-5*(math.sqrt(9)+2)",
    "(((5-3)/2)*0.5)+((2*2))*(((math.log(5)+2)-2))",
]
for item in samples:
    print(split_top_level(item))

This produces the intended tokens, preserving inner structure and names within parentheses.

['5', '+', '5', '*', '10']
['(2*2)', '-', '5', '*', '(math.sqrt(9)+2)']
['(((5-3)/2)*0.5)', '+', '((2*2))', '*', '(((math.log(5)+2)-2))']

Why this matters

Once you can reliably split by top-level operators, you can build higher-level parsing or interpretation on top, and you avoid the pitfalls of regex-based approaches that don’t understand nesting. If the end goal is to evaluate expressions, it is worth looking at the shunting yard algorithm. Also, this stateful approach is much faster than an re-driven split for this use case.

Conclusion

When you need to segment mathematical strings while keeping parentheses-balanced fragments intact, avoid regex splitting and track parentheses depth explicitly. Stream the input once, split only when the current depth is zero, and keep everything inside matching parentheses together, including alphabetic tokens like math.sqrt or π. This yields predictable tokens and a solid foundation for further processing or evaluation.