https://pytroubles.com/en/posts/id2956-fix-beautifulsoup-none-when-parsing-xml-with-selenium-use-xml-parser-and-fetch-via-requests

Fix BeautifulSoup None when parsing XML with Selenium: use XML parser and fetch via requests

How to Parse Namespaced XML with Selenium and BeautifulSoup: avoid None using an XML parser and requests

Fix BeautifulSoup None when parsing XML with Selenium: use XML parser and fetch via requests

BeautifulSoup returns None for XML in Selenium? Learn to get the XML URL, fetch it with requests, parse with the XML parser, and handle namespaces reliably.

2026-01-08T05:00:11+03:00

When you click through to an XML document with Selenium and then try to scrape it with BeautifulSoup, it’s easy to end up with an empty result. A common case: you fetch the page source from the new tab, call find("PhoneNum"), and get None. The issue isn’t the element’s absence, but how you retrieve and parse the content.Reproducing the issueThe sequence below opens a nonprofit profile, clicks the XML link, switches to the new tab, prints the final URL, and attempts to parse the document. The search for the PhoneNum element returns nothing.import timefrom bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import Bycfg = Options()# cfg.add_argument('--headless=new')cfg.add_argument("start-maximized")cfg.add_argument('--log-level=3')cfg.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 1})cfg.add_experimental_option("excludeSwitches", ["enable-automation"])cfg.add_experimental_option('excludeSwitches', ['enable-logging'])cfg.add_experimental_option('useAutomationExtension', False)cfg.add_argument('--disable-blink-features=AutomationControlled')svc = Service()browser = webdriver.Chrome(service=svc, options=cfg)# browser.minimize_window()waiter = WebDriverWait(browser, 10)start_url = "https://projects.propublica.org/nonprofits/organizations/830370609"browser.get(start_url)browser.execute_script("arguments[0].click();", waiter.until(EC.element_to_be_clickable((By.XPATH, '(//a[text()="XML"])[1]'))))browser.switch_to.window(browser.window_handles[1])time.sleep(3)print(browser.current_url)dom = BeautifulSoup(browser.page_source, 'lxml')phone_node = dom.find("PhoneNum")print(phone_node)The observed output shows a signed S3 URL for the XML and then None.What’s actually going onThere are two key friction points. First, the parser. Using 'lxml' with BeautifulSoup triggers HTML parsing rules, and HTML parsers convert tag names to lowercase; XML is case-sensitive and won’t match as expected under those rules. Using an XML parser such as 'xml' or 'lxml-xml' preserves original tag names. Second, the data itself. The PhoneNum element appears with a namespace, for example {http://www.irs.gov/efile}PhoneNum. That means a naive tag lookup against HTML-normalized output won’t match. There is also the practical matter that Selenium is overkill for downloading raw XML once you have its URL; fetching the document directly is faster and more reliable.There’s an additional pragmatic point. If you’re waiting for the XML tab to finish loading, a hard sleep can be flaky in production; an explicit wait for a condition is a steadier approach.The fixNavigate with Selenium only to obtain the XML URL, then close the browser. Download the XML with requests.get(), parse with BeautifulSoup using the 'xml' parser, and handle missing tags safely. If you want all occurrences of the element, use find_all().import timeimport requestsfrom bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import Bychrome_cfg = Options()chrome_cfg.add_argument("start-maximized")chrome_cfg.add_argument('--log-level=3')chrome_cfg.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_cfg.add_argument('--disable-blink-features=AutomationControlled')svc = Service()agent = webdriver.Chrome(service=svc, options=chrome_cfg)wdwait = WebDriverWait(agent, 10)landing_url = "https://projects.propublica.org/nonprofits/organizations/830370609"agent.get(landing_url)xml_link = wdwait.until(EC.element_to_be_clickable((By.XPATH, '(//a[text()="XML"])[1]')))agent.execute_script("arguments[0].click();", xml_link)agent.switch_to.window(agent.window_handles[1])time.sleep(3)feed_url = agent.current_urlagent.quit()resp = requests.get(feed_url)if resp.status_code != 200: print("Failed to download XML") exit()tree = BeautifulSoup(resp.content, 'xml')phones = tree.find_all('PhoneNum')if phones: print(f"Found {len(phones)} phone numbers:") for i, ph in enumerate(phones, start=1): print(f"{i}. {ph.text.strip()}")else: print("No <PhoneNum> tags found in the XML.")with open("propublica_data.xml", "wb") as fh: fh.write(resp.content)print("XML saved to 'propublica_data.xml'")If you prefer a condition instead of a fixed delay while switching to the XML tab, you can wait for the DOM to appear with a generic presence check.wdwait.until(EC.presence_of_element_located((By.XPATH, '//*')))The output for the dataset used above demonstrates that the elements are now discoverable and text can be extracted.Found 4 phone numbers:1. 60231460222. 60226875023. 60288124834. 6023146022XML saved to 'propublica_data.xml'Why this mattersXML is case-sensitive and often namespaced. HTML-oriented parsing silently reshapes those constraints and makes exact tag lookups fail. Switching to an XML parser preserves structure and casing, and going straight to the XML endpoint avoids the variability of browser-driven page sources. Finally, checking that elements exist before reading .text makes your code resilient to changes in source documents.TakeawaysUse Selenium only to surface the final document URL, then close it and fetch the XML with requests. Parse with BeautifulSoup using 'xml' so tag names and namespaces are preserved. Prefer find_all() if you expect multiple matches, and guard accesses with existence checks. Where timing is involved, lean on explicit waits rather than fixed sleeps. Keeping to these practices will make your XML scraping predictable and fast.

BeautifulSoup, Selenium, XML parsing, lxml-xml, XML namespaces, requests, PhoneNum tag, find vs find_all, explicit wait, case-sensitive tags, scraping ProPublica XML

2026

2026, Jan 08 05:00

How to Parse Namespaced XML with Selenium and BeautifulSoup: avoid None using an XML parser and requests

BeautifulSoup returns None for XML in Selenium? Learn to get the XML URL, fetch it with requests, parse with the XML parser, and handle namespaces reliably.

Reproducing the issue

The sequence below opens a nonprofit profile, clicks the XML link, switches to the new tab, prints the final URL, and attempts to parse the document. The search for the PhoneNum element returns nothing.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

cfg = Options()
# cfg.add_argument('--headless=new')
cfg.add_argument("start-maximized")
cfg.add_argument('--log-level=3')
cfg.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 1})
cfg.add_experimental_option("excludeSwitches", ["enable-automation"])
cfg.add_experimental_option('excludeSwitches', ['enable-logging'])
cfg.add_experimental_option('useAutomationExtension', False)
cfg.add_argument('--disable-blink-features=AutomationControlled')
svc = Service()
browser = webdriver.Chrome(service=svc, options=cfg)
# browser.minimize_window()
waiter = WebDriverWait(browser, 10)

start_url = "https://projects.propublica.org/nonprofits/organizations/830370609"
browser.get(start_url)
browser.execute_script("arguments[0].click();", waiter.until(EC.element_to_be_clickable((By.XPATH, '(//a[text()="XML"])[1]'))))
browser.switch_to.window(browser.window_handles[1])
time.sleep(3)
print(browser.current_url)
dom = BeautifulSoup(browser.page_source, 'lxml')
phone_node = dom.find("PhoneNum")
print(phone_node)

The observed output shows a signed S3 URL for the XML and then None.

What’s actually going on

There are two key friction points. First, the parser. Using 'lxml' with BeautifulSoup triggers HTML parsing rules, and HTML parsers convert tag names to lowercase; XML is case-sensitive and won’t match as expected under those rules. Using an XML parser such as 'xml' or 'lxml-xml' preserves original tag names. Second, the data itself. The PhoneNum element appears with a namespace, for example {http://www.irs.gov/efile}PhoneNum. That means a naive tag lookup against HTML-normalized output won’t match. There is also the practical matter that Selenium is overkill for downloading raw XML once you have its URL; fetching the document directly is faster and more reliable.

There’s an additional pragmatic point. If you’re waiting for the XML tab to finish loading, a hard sleep can be flaky in production; an explicit wait for a condition is a steadier approach.

The fix

Navigate with Selenium only to obtain the XML URL, then close the browser. Download the XML with requests.get(), parse with BeautifulSoup using the 'xml' parser, and handle missing tags safely. If you want all occurrences of the element, use find_all().

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

chrome_cfg = Options()
chrome_cfg.add_argument("start-maximized")
chrome_cfg.add_argument('--log-level=3')
chrome_cfg.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_cfg.add_argument('--disable-blink-features=AutomationControlled')

svc = Service()
agent = webdriver.Chrome(service=svc, options=chrome_cfg)
wdwait = WebDriverWait(agent, 10)

landing_url = "https://projects.propublica.org/nonprofits/organizations/830370609"
agent.get(landing_url)

xml_link = wdwait.until(EC.element_to_be_clickable((By.XPATH, '(//a[text()="XML"])[1]')))
agent.execute_script("arguments[0].click();", xml_link)

agent.switch_to.window(agent.window_handles[1])
time.sleep(3)
feed_url = agent.current_url
agent.quit()

resp = requests.get(feed_url)
if resp.status_code != 200:
    print("Failed to download XML")
    exit()

tree = BeautifulSoup(resp.content, 'xml')
phones = tree.find_all('PhoneNum')

if phones:
    print(f"Found {len(phones)} phone numbers:")
    for i, ph in enumerate(phones, start=1):
        print(f"{i}. {ph.text.strip()}")
else:
    print("No <PhoneNum> tags found in the XML.")

with open("propublica_data.xml", "wb") as fh:
    fh.write(resp.content)
print("XML saved to 'propublica_data.xml'")

If you prefer a condition instead of a fixed delay while switching to the XML tab, you can wait for the DOM to appear with a generic presence check.

wdwait.until(EC.presence_of_element_located((By.XPATH, '//*')))

The output for the dataset used above demonstrates that the elements are now discoverable and text can be extracted.

Found 4 phone numbers:
1. 6023146022
2. 6022687502
3. 6028812483
4. 6023146022
XML saved to 'propublica_data.xml'

Why this matters

XML is case-sensitive and often namespaced. HTML-oriented parsing silently reshapes those constraints and makes exact tag lookups fail. Switching to an XML parser preserves structure and casing, and going straight to the XML endpoint avoids the variability of browser-driven page sources. Finally, checking that elements exist before reading .text makes your code resilient to changes in source documents.

Takeaways

Use Selenium only to surface the final document URL, then close it and fetch the XML with requests. Parse with BeautifulSoup using 'xml' so tag names and namespaces are preserved. Prefer find_all() if you expect multiple matches, and guard accesses with existence checks. Where timing is involved, lean on explicit waits rather than fixed sleeps. Keeping to these practices will make your XML scraping predictable and fast.

beautifulsoup python selenium-webdriver xml