2025, Nov 06 07:00

How to Parse Semi-Formatted Headers and Paragraphs with Regex for Clean WordPress API Posts

Learn to parse semi-formatted text with regex: extract bold H-level and Markdown headers with their content, and format clean sections for WordPress API posts.

Parsing semi-formatted content into something you can post via the WordPress API boils down to extracting structure reliably. If your source has headers and paragraphs embedded as simple markers, a careful regular expression is often enough to segment the text into headers and their associated content.

Example input and a first attempt

Consider a source that mixes header markers and paragraphs like this:

'**H1: Some text**\n\nSome text as paragraph.\n\n**H2: A subheader**\n\nText from the subheader.\n\nA line break with some more text.\n\n**H2: Another sub hearder**\n\n**H3: A sub sub header

A naive extraction might start with patterns that hunt for a specific header and try to pull some nearby content:

import re

payload = myFullText
h1_hits = re.findall('H1.*?ph\.', payload)

payload = myFullText
h1_hits = re.findall('H1.*?\n\n\.', payload)

Both calls can end up empty, which is a good signal that the pattern does not align with the actual markers and line structure present in the text.

What’s really going on

Regular expressions are fine for this job, but only if they target the real structure. In the sample above, headers are marked as bold segments with a level token and a title, for example **H2: A subheader**. Matching an arbitrary run from H1 to a vague ending like ph., or assuming a specific combination of line breaks, won’t generalize and will often miss. Focusing the pattern on the header markers themselves is the key. When content should span multiple lines until the next header or the end, you also need a pattern that stops at the next header boundary rather than guessing where a paragraph ends.

Targeting bold H-level headers

If headers appear as bold markers such as **H1: Title**, capture the level and the title directly:

import re

source_blob = myFullText
hdr_expr = r"\*\*(H\d): (.*?)\*\*"
found_headers = re.findall(hdr_expr, source_blob)

This extracts each header level token like H1 together with its title between the asterisks. It is a precise way to collect all headers before mapping them to the content that follows.

Capturing Markdown-style # headers with their paragraphs

If the text uses Markdown-like # markers, you can match headers and the content that follows them in one pass. The following pattern collects the header marker, its text, and everything up to the next header or the end of the string:

import re

text_data = """
# Header 1
This is the first paragraph under header 1.

## Header 2
This is some text under header 2.
Another paragraph under the same header.

### Header 3
More content here.
"""

block_pat = r"^(#{1,6})\s+(.*?)\n(.*?)(?=\n#{1,6}\s+|\Z)"
chunks = re.findall(block_pat, text_data, re.DOTALL | re.MULTILINE)

for level_marks, heading, block in chunks:
    depth = len(level_marks)
    section_body = block.strip()
    print(f"H{depth}: {heading}")
    print(f"Paragraph:\n{section_body}")

This extracts the header level as the count of #, the header text itself, and the contiguous body beneath it, which is useful when you want to keep paragraphs together with their header.

To read only the header lines and strip out the # characters, match the header line and use the length of the marker to infer level:

import re

md_text = text_data
line_pat = r"^(#{1,6})\s+(.*)$"
header_lines = re.findall(line_pat, md_text, re.MULTILINE)

for marks, title_text in header_lines:
    print(f"H{len(marks)}: {title_text}")

Why this matters when posting to WordPress

WordPress expects well-structured content. Whether you convert these captures into HTML headings with paragraphs or map them into fields for your API call, extracting the exact header level and the associated text prevents malformed posts and avoids losing sections during import.

Wrap-up and practical advice

Use re.findall with patterns that mirror your actual markers. For bold H-level headers, match the **Hn: Title** segments precisely. For Markdown-style input, rely on an anchored, multiline expression with a boundary to the next header, and compute the level from the number of # signs. With that structure in hand, transforming the result into the shape the WordPress API expects becomes a straightforward formatting step.

The article is based on a question from StackOverflow by hacking_mike and an answer by Arpan Gautam.