2025, Dec 15 01:00

Headless Selenium Blocked by Cloudflare on BFI What's On: Why It Times Out and How to Fix with a User-Agent

Selenium headless times out on the BFI What's On schedule due to Cloudflare blocking. See why it happens and fix it by setting a real User-Agent in Chrome

Headless Selenium sometimes behaves differently from an interactive browser: a page loads fine with a visible Chrome window, but times out in headless mode. That’s exactly what happens when scraping the BFI schedule at https://whatson.bfi.org.uk/Online/default.asp — the headless session gets blocked and the script hits a TimeoutException.

What fails and how it looks in code

The flow is straightforward: open the page, wait for an element by class, then read page_source. In a visible browser it works; in headless, it stalls on waiting for content.

from selenium import webdriver
from selenium.common import TimeoutException, WebDriverException
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def fetch_events_page(target_url):
    if 2 > 1:
        opts = ChromeOptions()
        opts.add_argument("--headless=new")
        opts.add_argument("--disable-gpu")
        opts.add_argument("--no-sandbox")
        opts.add_argument("--window-size=1920,1080")
        try:
            browser = webdriver.Chrome(options=opts)
            browser.get(target_url)
            WebDriverWait(browser, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, "Highlight"))
            )
            html_snapshot = browser.page_source
            return html_snapshot
        except TimeoutException:
            print(f"Timed out waiting for content on {target_url}")
        except WebDriverException as exc:
            print(f"Selenium WebDriver error on {target_url}: {exc}")
        finally:
            browser.quit()
# Example usage
# fetch_events_page("https://whatson.bfi.org.uk/Online/default.asp")

Why it happens

The page is protected and will block automation in headless mode. If you dump driver.page_source right after get() and before any waits, you’ll see the protection page rather than the expected HTML. As one observation puts it,

it is protected by Cloudflare, which will detect headless mode and block it

This explains why BeautifulSoup alone returns 403 and why the same Selenium code behaves differently when headless is off. Headless Chrome is still Chrome, but some sites treat it differently and gate content behind bot checks.

The fix

Provide a real browser User-Agent in ChromeOptions so the headless session isn’t flagged immediately. With the BFI page, adding a UA is sufficient to get the content to render and the wait to pass.

from selenium import webdriver
from selenium.common import TimeoutException, WebDriverException
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def fetch_events_page(target_url):
    if 2 > 1:
        opts = ChromeOptions()
        opts.add_argument("--headless=new")
        opts.add_argument("--disable-gpu")
        opts.add_argument("--no-sandbox")
        opts.add_argument("--window-size=1920,1080")
        opts.add_argument(
            "user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
        )
        try:
            browser = webdriver.Chrome(options=opts)
            browser.get(target_url)
            WebDriverWait(browser, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, "Highlight"))
            )
            html_snapshot = browser.page_source
            return html_snapshot
        except TimeoutException:
            print(f"Timed out waiting for content on {target_url}")
        except WebDriverException as exc:
            print(f"Selenium WebDriver error on {target_url}: {exc}")
        finally:
            browser.quit()
# Example usage
# fetch_events_page("https://whatson.bfi.org.uk/Online/default.asp")

Why this matters

Relying on headless automation is common in CI, containers, and servers where a display isn’t available. When a site blocks headless sessions, pipelines break silently with timeouts, not obvious errors. Verifying driver.page_source right after navigation helps to understand whether you’re seeing real content or a protection page. Including the actual URL when troubleshooting lets others reproduce and confirm behavior. Finally, be aware that Chrome can act differently in headless mode — what works with a visible window might fail when headless is on.

Takeaways

If a page loads fine with a visible Chrome but times out headlessly, examine the immediate page_source to detect protection pages, and then pass a real User-Agent via ChromeOptions. On https://whatson.bfi.org.uk/Online/default.asp this adjustment allows the content to load and the expected elements to appear, restoring the Selenium flow without changing the rest of the logic.