https://pytroubles.com/en/posts/id2389-preserve-newlines-in-xml-attribute-values-with-python-avoid-parser-normalization-in-lxml-xml-etree

Preserve Newlines in XML Attribute Values with Python: Avoid Parser Normalization in lxml/xml.etree

How to Preserve Literal Line Breaks in XML Attribute Values in Python (lxml, xml.etree, BeautifulSoup)

Preserve Newlines in XML Attribute Values with Python: Avoid Parser Normalization in lxml/xml.etree

Learn why lxml and xml.etree collapse line breaks in XML attribute values and how to preserve them using raw text capture, regex, and BeautifulSoup methods.

2025-12-10T17:00:11+03:00

Preserving literal line breaks inside XML attribute values can be surprising: parse a document, write it back, and suddenly those newlines are gone. If you touch the file with xml.etree or lxml.etree, you’ll see the parser normalize attribute whitespace, including line breaks. The goal here is to modify the XML while keeping those embedded newlines intact in the final output.Problem demonstrationTake a minimal XML with a multiline attribute value:<?xml version='1.0' encoding='UTF-8'?> <xml> <test value="This is a test with line breaks "/> </xml> Now run a straightforward parse-and-write with lxml:import lxml.etree as LX with open("input.xml", "r", encoding="utf-8") as fh: tree = LX.parse(fh) top = tree.getroot() out_doc = LX.ElementTree(top) out_doc.write("output.xml", encoding="utf-8", xml_declaration=True) The resulting file loses the line breaks inside the attribute and replaces them with spaces:<?xml version='1.0' encoding='UTF-8'?> <xml> <test value="This is a test with line breaks "/> </xml> Why it happensThis behavior follows the XML specification: attribute value normalization collapses certain whitespace and converts line breaks. In practice, a conforming XML parser replaces the line breaks in attribute values during parsing. Both xml.etree and lxml.etree follow this rule, so the original newlines are not preserved once the parser has processed the document.Workable approach to keep the newlinesTo avoid losing the raw attribute text, read the XML as plain text first, capture the attribute value before any XML parsing happens, and then reapply this raw value onto the element after parsing. One way to do that is using a small regex to pull the attribute value as-is and then use BeautifulSoup to parse and update the XML.import re from bs4 import BeautifulSoup # Read XML as raw text with open("input.xml", encoding="utf-8") as fh: doc_text = fh.read() # Capture the multiline attribute value unchanged m = re.search(r'value="(.*?)"', doc_text, re.DOTALL) preserved_val = m.group(1) if m else None print("Raw string: ", repr(preserved_val)) print() # Parse XML and re-apply the raw attribute value dom = BeautifulSoup(doc_text, "xml") node = dom.find("test") if preserved_val is not None: node["value"] = preserved_val # Show the result print(dom.prettify()) Output:Raw string: 'This is\n a test\n\n with line breaks\n ' <?xml version="1.0" encoding="utf-8"?> <xml> <test value="This is a test with line breaks "/> </xml> This preserves the newlines in the attribute value. According to the discussion around the spec, another practical angle is that if you write XML yourself, you can use the entity &#10; to represent a line break. In the same vein, after capturing the raw multiline value and assigning it as an attribute, a serializer may emit &#10; for those breaks; if needed, replace those with your preferred newline sequence afterward. In some cases, you can rely solely on regex extraction and your existing XML library to set the attribute value, skipping BeautifulSoup entirely.Why this is worth knowingWhen you batch-edit XML, subtle serialization rules can silently change data representation. If downstream tooling expects literal line breaks inside attributes or you need to preserve the exact original formatting, parser normalization will get in your way. Understanding that this is mandated behavior helps you choose a workflow that keeps the raw text when it matters.Conclusion and practical tipsIf you need to modify an XML document without flattening newlines in attribute values, capture those values directly from the raw text before any parsing takes place, then put them back after you build or traverse the tree. If you are generating XML yourself and must encode a line break reliably inside an attribute, use the &#10; entity. Test your pipeline end to end to ensure the output matches your expectations and, when necessary, perform a final substitution to align with the newline style you require.

preserve newlines in XML attributes, XML attribute normalization, lxml, xml.etree, BeautifulSoup, Python, regex extraction, line breaks, XML parsing, XML serialization, 
 entity

2025

2025, Dec 10 17:00

How to Preserve Literal Line Breaks in XML Attribute Values in Python (lxml, xml.etree, BeautifulSoup)

Learn why lxml and xml.etree collapse line breaks in XML attribute values and how to preserve them using raw text capture, regex, and BeautifulSoup methods.

Problem demonstration

Take a minimal XML with a multiline attribute value:

<?xml version='1.0' encoding='UTF-8'?>
<xml>
    <test value="This is
    a test
    with line breaks
    "/>
</xml>

Now run a straightforward parse-and-write with lxml:

import lxml.etree as LX
with open("input.xml", "r", encoding="utf-8") as fh:
    tree = LX.parse(fh)
    top = tree.getroot()
    out_doc = LX.ElementTree(top)
    out_doc.write("output.xml", encoding="utf-8", xml_declaration=True)

The resulting file loses the line breaks inside the attribute and replaces them with spaces:

<?xml version='1.0' encoding='UTF-8'?>
<xml>
    <test value="This is  a test   with line breaks  "/>
</xml>

Why it happens

This behavior follows the XML specification: attribute value normalization collapses certain whitespace and converts line breaks. In practice, a conforming XML parser replaces the line breaks in attribute values during parsing. Both xml.etree and lxml.etree follow this rule, so the original newlines are not preserved once the parser has processed the document.

Workable approach to keep the newlines

To avoid losing the raw attribute text, read the XML as plain text first, capture the attribute value before any XML parsing happens, and then reapply this raw value onto the element after parsing. One way to do that is using a small regex to pull the attribute value as-is and then use BeautifulSoup to parse and update the XML.

import re
from bs4 import BeautifulSoup
# Read XML as raw text
with open("input.xml", encoding="utf-8") as fh:
    doc_text = fh.read()
# Capture the multiline attribute value unchanged
m = re.search(r'value="(.*?)"', doc_text, re.DOTALL)
preserved_val = m.group(1) if m else None
print("Raw string: ", repr(preserved_val))
print()
# Parse XML and re-apply the raw attribute value
dom = BeautifulSoup(doc_text, "xml")
node = dom.find("test")
if preserved_val is not None:
    node["value"] = preserved_val
# Show the result
print(dom.prettify())

Output:

Raw string:  'This is\n    a test\n\n    with line breaks\n    '
<?xml version="1.0" encoding="utf-8"?>
<xml>
 <test value="This is
    a test
    with line breaks
    "/>
</xml>

This preserves the newlines in the attribute value. According to the discussion around the spec, another practical angle is that if you write XML yourself, you can use the entity 
 to represent a line break. In the same vein, after capturing the raw multiline value and assigning it as an attribute, a serializer may emit 
 for those breaks; if needed, replace those with your preferred newline sequence afterward. In some cases, you can rely solely on regex extraction and your existing XML library to set the attribute value, skipping BeautifulSoup entirely.

Why this is worth knowing

When you batch-edit XML, subtle serialization rules can silently change data representation. If downstream tooling expects literal line breaks inside attributes or you need to preserve the exact original formatting, parser normalization will get in your way. Understanding that this is mandated behavior helps you choose a workflow that keeps the raw text when it matters.

Conclusion and practical tips

If you need to modify an XML document without flattening newlines in attribute values, capture those values directly from the raw text before any parsing takes place, then put them back after you build or traverse the tree. If you are generating XML yourself and must encode a line break reliably inside an attribute, use the 
 entity. Test your pipeline end to end to ensure the output matches your expectations and, when necessary, perform a final substitution to align with the newline style you require.

elementtree lxml python