2025, Dec 29 23:00
Parsing Semi-Structured Text into a Python Dictionary: cleaner regex pipeline with CRLF, quotes, casing, and first-colon split
Cleaner Python regex pipeline to parse semi-structured text into a dict: handle CRLF, quotes, casing, and first-colon splits without data loss reliably
Parsing semi-structured text into a dictionary is a common task, but it quickly gets messy when the source includes inconsistent casing, quoted lines, mixed line endings, and values that contain colons. The goal here is to keep the logic intact while reducing redundant string processing and making the flow easier to reason about.
The input we have to work with
The raw payload mixes CRLF markers, quoted lines, inconsistent key casing, keys with spaces, and values like time that contain additional colons. Only the lines following the literal token text are relevant.
\r\n; Count of Something: 3\r\ntext\r\n"Key1: 9999999, Key2: mnkhkljh213, Key3: 593, Key4: 66666"\r\n"Key5 something: sample, Desc: , Date: 4/28/2025, Time: 4:15 PM"\r\n"ANOTHERKEY: 622523, KEY1: 9999999, KEY6: 160305, KEY7: 0, KEY8: 10, KEY11: 1, DATE: 4/28/2025, TIME: 16:15:50"\r\nKey duplication and case differences are acceptable for downstream logic. The split between key and value must handle values like Time: 16:15:50 without breaking, which means splitting on the first colon only.
Original approach
The working expression below extracts the part after text, normalizes it by stripping carriage returns, quotes, and spaces, splits lines, removes empties, rejoins them by commas, and finally constructs a dictionary by splitting each token at the first colon.
dict(\
chunk.split(':', 1) \
for chunk in re.sub(' ', '', re.sub('"', '', ','.join(\
list(filter(None, re.sub('\r', '', payload.split('text')[1]).split('\n')))))).split(',')\
)This works, but it repeats similar transformations and allocates an unnecessary list. It also removes \r and then splits by \n, which can be folded into a single operation when line endings are consistent.
What’s actually going on
The pipeline follows a predictable sequence. First, it discards everything before the literal text and keeps what follows. Next, it strips carriage returns and splits on newlines, which may produce empty elements that are filtered out. Afterwards, it rejoins the cleaned lines with commas to get a single comma-separated string. Then it removes double quotes and spaces. Finally, it splits by commas to get key-value tokens and splits each token into a pair at the first colon. The use of split(':', 1) is crucial to ensure values that contain additional colons, like time, are preserved as-is. The remaining pain points are redundant substitutions, unnecessary list materialization before join, and a two-step CR/LF cleanup where a single split would suffice under the stated constraints.
A cleaner pass with the same behavior
The same logic can be expressed more concisely by merging the outer substitutions, eliminating list wrapping, and splitting directly on \r\n if that’s the consistent line ending.
dict(\
entry.split(':', 1) \
for entry in re.sub(' |"', '', ','.join(\
filter(None, payload.split('text')[1].split('\r\n'))\
)).split(',')\
)Using a single regular expression re.sub(' |"', '', ...) removes spaces and double quotes in one pass. The return value of filter is already an iterable, so join does not need a list. If \n always appears with \r, then splitting by \r\n is equivalent to removing \r and splitting by \n, but with fewer steps. The key-value split remains constrained to the first colon, which correctly handles values like 16:15:50.
Why this matters
When you parse text like this at scale or in tight loops, each extra pass over the data accumulates cost and increases cognitive load. Consolidating transformations makes the pipeline easier to audit and reason about. It also reduces the chance of subtle mismatches between separate clean-up steps, such as handling of line endings. Simplifying the iterable handling avoids unnecessary allocations without changing behavior.
Takeaways
Keep normalization steps close together and eliminate duplicates when possible. Remove needless conversions that don’t change semantics, such as wrapping filter output in a list just to feed join. Be explicit about line endings; if \n always arrives with \r, split by \r\n and move on. Most importantly, when delimiters can appear in values, always cap the split at the first occurrence, as with split(':', 1) here. With these adjustments, the result stays faithful to the original logic while being easier to maintain and extend.