https://pytroubles.com/en/posts/id1020-how-to-download-protected-nyscef-pdfs-without-403-errors-seleniumbase-cdp-mode-guide

How to Download Protected NYSCEF PDFs Without 403 Errors: SeleniumBase CDP Mode Guide

Fixing 403 Forbidden and Blank Viewer When Downloading NYSCEF PDFs: Use SeleniumBase CDP Mode

How to Download Protected NYSCEF PDFs Without 403 Errors: SeleniumBase CDP Mode Guide

Learn why NYSCEF PDFs return 403 or show a blank viewer and how to reliably download protected files using SeleniumBase in CDP Mode with external_pdf enabled.

2025-10-18T20:00:06+03:00

Downloading a protected PDF from the NYSCEF portal looks deceptively simple until you run into the classic 403 Forbidden or a blank viewer page. If you’ve already tried mimicking browser headers with requests or scraping the page with Selenium, only to find no <embed> tag and an empty UI, you’ve seen the same roadblocks. Below is a concise walkthrough of why this happens and how to make the download work reliably with SeleniumBase.Reproducing the failureThe first instinct is to fetch the document via HTTP or fall back to browser automation, extract the PDF URL from an <embed>, and then stream the binary with the session cookies. Here’s a minimal example that does exactly that, but fails on this site:from seleniumbase import SB import requests import os import time def pull_pdf_via_browser_then_http(): doc_link = "https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw==" out_dir = os.path.join(os.getcwd(), "downloads") os.makedirs(out_dir, exist_ok=True) out_path = os.path.join(out_dir, "NYSCEF_Document.pdf") with SB(headless=True) as browser: browser.open(doc_link) time.sleep(5) try: node_embed = browser.find_element("embed") pdf_link = node_embed.get_attribute("src") print(f"PDF URL detected: {pdf_link}") except Exception as err: print(f"Embed tag not found: {err}") return cookies_from_driver = browser.driver.get_cookies() http_sess = requests.Session() for c in cookies_from_driver: http_sess.cookies.set(c["name"], c["value"]) net_headers = { "User-Agent": "Mozilla/5.0", "Referer": doc_link, } resp = http_sess.get(pdf_link, headers=net_headers) if resp.status_code == 200 and "application/pdf" in resp.headers.get("Content-Type", ""): with open(out_path, "wb") as fh: fh.write(resp.content) print(f"Saved to: {out_path}") else: print(f"Download failed. Status: {resp.status_code}") print(f"Content-Type: {resp.headers.get('Content-Type')}") print(f"Resolved URL: {resp.url}") if __name__ == "__main__": pull_pdf_via_browser_then_http() The typical outcome is either a 403 on the HTTP request or no <embed> node present at all when the page renders. Even a direct requests.get with a browser-like User-Agent and Referer consistently returns 403.Why it fails hereOn this endpoint, simply replaying headers is not enough. The server can differentiate real browsers from scripts through mechanisms such as TLS fingerprinting and runtime checks performed via JavaScript evaluation. That’s why a static HTTP request that looks identical on the surface still gets blocked, and why a browser session might show a blank page without ever exposing an <embed> or a direct PDF URL. The key point is that the file is not freely available; access is controlled, and non-human clients are screened out. If you believe you should have access but remain blocked, you need to reach out to the site owners.The working approachFor this case, using SeleniumBase in a specific mode makes the difference. Running in CDP Mode with parameters that enable automated PDF downloading allows the session to avoid bot-detection and save the file directly to the default download location. This bypasses the need to scrape an <embed> or reissue the download via requests.from seleniumbase import SB with SB(uc=True, test=True, external_pdf=True) as driver: resource_url = "https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw==" driver.activate_cdp_mode(resource_url) driver.sleep(10) This saves the document into the ./downloaded_files/ directory. In practice, enabling CDP Mode is what gets past the protections that prevent non-humans from fetching the resource, and configuring the session to handle PDFs as downloads rather than inline content avoids the blank-viewer problem.Why this mattersModern public endpoints frequently gate content behind bot-detection. It’s not enough to clone headers, and sometimes even a headless browser won’t surface the expected DOM nodes. Knowing when to switch from raw HTTP to a real browser automation stack that can pass fingerprinting checks will save hours of dead-end debugging. It also helps to recognize situations where the resource is intentionally protected and cannot be fetched without meeting those checks.TakeawaysWhen a direct request returns 403 and the page offers no <embed> for you to harvest, assume you’re blocked by anti-bot controls rather than missing a selector. Use a browser automation tool that can operate in a mode designed to evade bot-detection and configure it to download PDFs instead of rendering them inline. If you have a legitimate right to the file but still can’t retrieve it, coordinate with the site maintainers for proper access.

NYSCEF, protected PDF, 403 Forbidden, SeleniumBase, CDP Mode, anti-bot detection, TLS fingerprinting, browser automation, download PDF, headless browser, embed tag missing, external_pdf, Python

2025

2025, Oct 18 20:00

Fixing 403 Forbidden and Blank Viewer When Downloading NYSCEF PDFs: Use SeleniumBase CDP Mode

Learn why NYSCEF PDFs return 403 or show a blank viewer and how to reliably download protected files using SeleniumBase in CDP Mode with external_pdf enabled.

Reproducing the failure

The first instinct is to fetch the document via HTTP or fall back to browser automation, extract the PDF URL from an <embed>, and then stream the binary with the session cookies. Here’s a minimal example that does exactly that, but fails on this site:

from seleniumbase import SB
import requests
import os
import time
def pull_pdf_via_browser_then_http():
    doc_link = "https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw=="
    out_dir = os.path.join(os.getcwd(), "downloads")
    os.makedirs(out_dir, exist_ok=True)
    out_path = os.path.join(out_dir, "NYSCEF_Document.pdf")
    with SB(headless=True) as browser:
        browser.open(doc_link)
        time.sleep(5)
        try:
            node_embed = browser.find_element("embed")
            pdf_link = node_embed.get_attribute("src")
            print(f"PDF URL detected: {pdf_link}")
        except Exception as err:
            print(f"Embed tag not found: {err}")
            return
        cookies_from_driver = browser.driver.get_cookies()
        http_sess = requests.Session()
        for c in cookies_from_driver:
            http_sess.cookies.set(c["name"], c["value"])
        net_headers = {
            "User-Agent": "Mozilla/5.0",
            "Referer": doc_link,
        }
        resp = http_sess.get(pdf_link, headers=net_headers)
        if resp.status_code == 200 and "application/pdf" in resp.headers.get("Content-Type", ""):
            with open(out_path, "wb") as fh:
                fh.write(resp.content)
            print(f"Saved to: {out_path}")
        else:
            print(f"Download failed. Status: {resp.status_code}")
            print(f"Content-Type: {resp.headers.get('Content-Type')}")
            print(f"Resolved URL: {resp.url}")
if __name__ == "__main__":
    pull_pdf_via_browser_then_http()

The typical outcome is either a 403 on the HTTP request or no <embed> node present at all when the page renders. Even a direct requests.get with a browser-like User-Agent and Referer consistently returns 403.

Why it fails here

On this endpoint, simply replaying headers is not enough. The server can differentiate real browsers from scripts through mechanisms such as TLS fingerprinting and runtime checks performed via JavaScript evaluation. That’s why a static HTTP request that looks identical on the surface still gets blocked, and why a browser session might show a blank page without ever exposing an <embed> or a direct PDF URL. The key point is that the file is not freely available; access is controlled, and non-human clients are screened out. If you believe you should have access but remain blocked, you need to reach out to the site owners.

The working approach

For this case, using SeleniumBase in a specific mode makes the difference. Running in CDP Mode with parameters that enable automated PDF downloading allows the session to avoid bot-detection and save the file directly to the default download location. This bypasses the need to scrape an <embed> or reissue the download via requests.

from seleniumbase import SB
with SB(uc=True, test=True, external_pdf=True) as driver:
    resource_url = "https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw=="
    driver.activate_cdp_mode(resource_url)
    driver.sleep(10)

This saves the document into the ./downloaded_files/ directory. In practice, enabling CDP Mode is what gets past the protections that prevent non-humans from fetching the resource, and configuring the session to handle PDFs as downloads rather than inline content avoids the blank-viewer problem.

Why this matters

Modern public endpoints frequently gate content behind bot-detection. It’s not enough to clone headers, and sometimes even a headless browser won’t surface the expected DOM nodes. Knowing when to switch from raw HTTP to a real browser automation stack that can pass fingerprinting checks will save hours of dead-end debugging. It also helps to recognize situations where the resource is intentionally protected and cannot be fetched without meeting those checks.

Takeaways

When a direct request returns 403 and the page offers no <embed> for you to harvest, assume you’re blocked by anti-bot controls rather than missing a selector. Use a browser automation tool that can operate in a mode designed to evade bot-detection and configure it to download PDFs instead of rendering them inline. If you have a legitimate right to the file but still can’t retrieve it, coordinate with the site maintainers for proper access.

The article is based on a question from StackOverflow by Daremitsu and an answer by Michael Mintz.

pdf python python-requests selenium-webdriver web-scraping