2025, Dec 07 19:00
Extract Name-Value Pairs from Semi-Structured Text with a Robust Python Regex
Learn how to parse semi-structured text in Python with a regex that reliably captures name-value pairs over 3+ dot leaders, ignores trailing noise, and outputs tidy data.
Parsing semi-structured text is fun until the delimiters stop behaving. A common case: names and values are paired, separated by a variable number of periods, with multiple pairs on a line and arbitrary junk tacked on at the end. The goal is to consistently extract clean (name, value) pairs without overfitting to a single line layout.
Problem setup
Consider the input where each name is followed by 3 or more dots, then a numeric value. Lines can contain multiple pairs, some lines have no useful data at all, and any unwanted text appears only after the last value on a line.
raw_blob = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'Splitting on newlines and on runs of 3+ periods looks tempting, but it breaks the association between a name and its value.
import re
raw_blob = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'
pieces = re.split(r"\.{3,}|\n", raw_blob)
print(pieces)['apples, red ', ' 0.15 apples, green ', ' 0.99', 'bananas (bunch)', ' 0.111', 'fruit salad, small', '1.35 [unwanted stuff #1.11 here]', 'unwanted line here', 'fruit salad, large ', ' 1.77 strawberry ', ' 0.66 unwanted 00-11info here']The result is close, but not usable as-is: the split happens between the name and its value, and leftover garbage remains after some values.
What actually defines the structure
The useful pattern is stable even if the surrounding text isn’t. Each pair looks like a name that does not contain digits, dots, or newlines, followed by optional spaces, then 3 or more dots, more optional spaces, and then a number. Any trailing content after that number is irrelevant and should be ignored.
A regex that captures only what we need
Instead of splitting, match the pairs directly and capture both parts:
([^\d.\n]+)[^\S\n]*\.{3,}[^\S\n]*(\d+.\d+)The first group matches the name by excluding digits, dots, and newlines and consuming everything else until the delimiter zone. The middle section tolerates any amount of horizontal whitespace on either side of the dot run and requires at least three dots. The second group captures a number that appears after the dots. Since unwanted text always follows the value and sits at the end of the line, it never gets captured.
Solution in code
import re
raw_blob = 'apples, red .... 0.15 apples, green ... 0.99\nbananas (bunch).......... 0.111\nfruit salad, small........1.35 [unwanted stuff #1.11 here]\nunwanted line here\nfruit salad, large .... 1.77 strawberry ........ 0.66 unwanted 00-11info here'
rx = r"([^\d.\n]+)[^\S\n]*\.{3,}[^\S\n]*(\d+.\d+)"
pairs = re.findall(rx, raw_blob)
"\n".join(" | ".join(z.strip() for z in grp) for grp in pairs)Expected output formatting for downstream tools like R or Excel can look like this:
apples, red | 0.15
apples, green | 0.99
bananas (bunch) | 0.111
fruit salad, small | 1.35
fruit salad, large | 1.77
strawberry | 0.66Why this works
The approach avoids the primary pitfall of splitting on delimiters that occur between the two halves of a pair. By matching full pairs and capturing the two parts, you preserve the association between name and value, tolerate multiple pairs on the same line, and ignore anything after the last value on a line. Using findall returns exactly the pairs you care about, so the post-processing is minimal and predictable.
Practical notes
If you are considering alternate techniques like a lookahead assertion, you can certainly experiment. The presented pattern already leverages a clear structure in the data and demonstrates how findall can directly yield the clean dataset. It is also convenient to validate the approach on a small subsample of your text before running it across the full corpus.
Wrap-up
When delimiters vary and noise creeps in, prefer matching the structure you want rather than splitting and stitching pieces back together. Here, a compact regex that targets name, dots, and number reliably extracts pairs even in the presence of multiple entries per line and trailing junk. Keep the delimiter zone explicit, capture only what you need, and finish with a lightweight join to produce a tidy export-ready format.