2025, Dec 22 21:00

How to Extract All Text from HTML in Python ElementTree: Use itertext(), Not .text, and Normalize Whitespace

Learn how to extract all text from HTML in Python ElementTree using itertext(). Fix nested content missed by .text and normalize whitespace for complete output.

Extracting text from HTML with Python’s xml.etree can feel deceptively simple until nested tags enter the picture. Accessing .text on an element returns only the immediate text node, skipping anything nested deeper. If you’re aggregating content from a DOM subtree, that behavior is a problem — but you don’t need to manually walk all descendants to fix it.

Problem

Parsing an HTML snippet and reading .text from a target element seems like an obvious approach, yet it doesn’t collect text from nested nodes. Here’s a minimal example that shows why the result is incomplete when you rely on .text:

import xml.etree.ElementTree as XET
markup_src = """<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>"""
dom_root = XET.fromstring(markup_src)
body_node = dom_root.find(".//body")
# Only the immediate inner text, not nested text
immediate_text = body_node.text
print(immediate_text)

Why it happens

The .text attribute returns only the direct inner text of the element. It does not include text inside nested elements. When the structure contains child tags, relying on .text means you’ll miss content that lives deeper in the tree.

Solution

Use itertext() to iterate over all text content in the subtree and concatenate it. If you want a compact version without indentation, line breaks, or extra spacing, normalize the whitespace afterward. You can also use strip() when you simply want to trim leading and trailing whitespace.

import xml.etree.ElementTree as XET
markup_src = """<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>"""
doc_root = XET.fromstring(markup_src)
focus_node = doc_root.find(".//body")
# Collect all text in the subtree
text_aggregate = ''.join(focus_node.itertext())
# Compact it by collapsing whitespace (alternative: text_aggregate.strip())
text_compact = ' '.join(text_aggregate.split())
print(text_aggregate)
print(text_compact)

Output:

        Moved to example.org
        or example.com.
Moved to example.org or example.com.

Why it matters

Text extraction is often a precursor to downstream tasks like indexing, matching, or simple rendering. Missing nested text means losing meaningful content. itertext() provides a straightforward, built-in way to gather everything without writing a manual traversal.

Takeaways

When you need the full textual content of an element and all its descendants, iterate over the DOM with itertext() and then optionally normalize whitespace. This avoids incomplete results from .text and keeps your parsing logic concise, predictable, and easier to maintain.