2025, Sep 27 11:00

Split text only on br in BeautifulSoup: stop stripped_strings breaks with unwrap() and smooth()

Learn how to preserve line text extraction in BeautifulSoup: split only on br, avoid DOM fragmentation from font and other tags using unwrap() and smooth().

When extracting structured text with BeautifulSoup, it’s easy to accidentally split content on markup boundaries you don’t care about. A common case: you want to break text on HTML line breaks only, but formatting tags such as font end up fragmenting the content and polluting the output with extra splits. Below is a concise walkthrough of the issue and a clean way to get reliable line-based extraction.

Problem overview

Suppose you’re iterating over blockquote sections and want each line to be split only where there is a br. If a font tag appears mid-line, BeautifulSoup’s stripped_strings can split that line into multiple fragments, even though visually it’s a single line. The goal is to keep lines intact and split solely on br.

Reproducible input HTML

Here is a simplified fragment that demonstrates the issue. A font tag interrupts the same logical line and causes an unwanted split:

<blockquote>
<p>RT CLOAK<a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesc3.htm#1057"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a><br><font color="FFFFFF">RT </font>COAT<a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesc4.htm#1082"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a><br>BT GARMENT<a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesg1.htm#2099"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a><br><i>NP ABBA</i><a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesa1.htm#3"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a></p>
<hr width="90%"></blockquote>
<p><a name="3"></a><i>ABBA NP</i></p>
<blockquote>

Code that exposes the problem

The following script walks a page, finds blockquote sections, and prints the associated text lines. The output, however, gets split within a single logical line due to the font tag. The program iterates using stripped_strings, which is where the fragmentation occurs.

from bs4 import BeautifulSoup
import requests
sources = ['https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesa1.htm']
counter = 1
for url in sources:
    resp = requests.get(url)
    markup = resp.text
    dom = BeautifulSoup(markup, "lxml")
    blocks = dom.find_all('blockquote')
    for blk in blocks:
        header = blk.find_previous('p')
        for chunk in blk.stripped_strings:
            print(counter, header.text, chunk)
        counter = counter + 1

Typical undesired output includes a split like “RT” and “COAT” as two separate items, even though they belong to a single line separated by br only.

Why this happens

stripped_strings yields text broken up by the structure of the DOM. Formatting elements such as font can divide a line into separate text nodes. Even if visually it’s one continuous line, the text nodes are split, and stripped_strings emits them independently.

Solution approach

If you want to split only on br and ignore formatting, remove the intervening formatting nodes so they no longer cut the text. A direct way to do this is to unwrap the font tags so they leave just their textual content in place, then merge adjacent text nodes before iterating with stripped_strings.

Unwrapping is done with .unwrap(). Merging neighboring text nodes created by the unwrapping is done with .smooth(). It’s important to call .smooth() after the unwrapping so that the newly adjacent strings get combined.

Minimal working demonstration

Here is a compact example that operates on the sample HTML above and returns the expected list of lines:

from bs4 import BeautifulSoup
sample = """
<blockquote>
<p>RT CLOAK<a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesc3.htm#1057"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a><br><font color="FFFFFF">RT </font>COAT<a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesc4.htm#1082"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a><br>BT GARMENT<a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesg1.htm#2099"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a><br><i>NP ABBA</i><a href="https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesa1.htm#3"><img src="./British Museum Object Names Theasarus_ Terms AB-AM_files/link.gif" border="0"></a></p>
<hr width="90%"></blockquote>
<p><a name="3"></a><i>ABBA NP</i></p>
</blockquote>
"""
doc = BeautifulSoup(sample, "lxml")
for node in doc.find_all("font"):
    node.unwrap()
doc.smooth()
for bq in doc.find_all("blockquote"):
    print(list(bq.stripped_strings))

The output is a clean list of lines:

['RT CLOAK', 'RT COAT', 'BT GARMENT', 'NP ABBA']

Fixed end-to-end script

Applying the same idea to the earlier page walker, unwrapping and smoothing ahead of iteration makes stripped_strings honor only the actual line breaks. The logic stays the same; the DOM has fewer formatting boundaries to split on.

from bs4 import BeautifulSoup
import requests
sources = ['https://terminology.collectionstrust.org.uk/British-Museum-objects/Obthesa1.htm']
counter = 1
for url in sources:
    resp = requests.get(url)
    markup = resp.text
    dom = BeautifulSoup(markup, "lxml")
    for f in dom.find_all("font"):
        f.unwrap()
    dom.smooth()
    quotes = dom.find_all('blockquote')
    for quote in quotes:
        title = quote.find_previous('p')
        for line in quote.stripped_strings:
            print(counter, title.text, line)
        counter += 1

Why this detail matters

When you normalize HTML for downstream tasks, consistent line segmentation is critical. If you intend to split only on br, letting inline formatting break your text will fragment tokens and compromise any logic that relies on contiguous strings per line. Eliminating irrelevant structural boundaries before extracting text makes the output predictable and easier to process.

If you must split exclusively on br, another practical approach is to replace br with a unique separator in the HTML, collect the concatenated text, then split on that separator. This is particularly useful when you cannot or do not want to remove other tags.

Takeaways

Use stripped_strings to iterate clean text, but remember it follows the DOM boundaries. If inline formatting breaks your lines, unwrap the offending tags first and call smooth to merge adjacent text nodes. That way, your splitting logic is driven by actual line breaks, not by styling artifacts.

The article is based on a question from StackOverflow by James Brian and an answer by jqurious.