2025, Oct 19 21:00

How to Reliably Download 1969 Gazzetta Ufficiale Serie Generale PDFs with Selenium

Learn why scraping 1969 Gazzetta Ufficiale Serie Generale PDFs fails with requests and how Selenium in headless Chrome reliably downloads PDFs via session links

Scraping historical PDFs from the Italian Gazzetta Ufficiale – Serie Generale for 1969 looks deceptively simple: there is a public archive, detail pages per issue, and a familiar-looking download endpoint. In practice, static approaches often surface no links at all, and hand-crafted download URLs return HTML error pages instead of binary PDFs. Below is a concise walkthrough of why this happens and how to reliably automate the downloads with Selenium.

Problem overview

Direct HTTP requests to the 1969 archive index work fine and the detail pages are easy to enumerate. The trouble begins when trying to extract the actual “pubblicazione completa non certificata” link. In many live sessions the expected anchors like a.download_pdf don’t exist in the server-rendered HTML, even though they may appear in other saved copies. Attempting to manufacture a /do/gazzetta/downloadPdf URL from date and issue number yields HTML with the message “Il pdf selezionato non è stato trovato”. Selenium navigation to the year picker is also fragile when using naïve element selection because the year option can be hidden, framed, or covered by overlays; however, driving the UI with keyboard navigation and then parsing the resulting page source reliably exposes the real download links.

Minimal failing approach (requests + BeautifulSoup)

The following example gathers all 1969 detail pages and then tries to discover a working PDF download link on one of them. It matches the behavior described above: detail pages are found, but there is no download link in the live DOM.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, parse_qs
ORIGIN = "https://www.gazzettaufficiale.it"
TARGET_YEAR = 1969
YEAR_INDEX = f"{ORIGIN}/ricercaArchivioCompleto/serie_generale/{TARGET_YEAR}"
http = requests.Session()
http.headers.update({
    "User-Agent": "Mozilla/5.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Referer": ORIGIN,
})
# 1) Collect detail pages (date + issue number)
resp = http.get(YEAR_INDEX, timeout=60)
resp.raise_for_status()
dom = BeautifulSoup(resp.text, "html.parser")
detail_pages = []
for node in dom.find_all("a", href=True):
    link = node["href"]
    if ("/gazzetta/serie_generale/caricaDettaglio" in link
        and "dataPubblicazioneGazzetta=" in link
        and "numeroGazzetta=" in link):
        detail_pages.append(urljoin(ORIGIN, link))
print("Detail pages found:", len(detail_pages))
print("Sample:", detail_pages[:3])
# 2) For one detail page, try to discover a real "download PDF" link
detail_url = detail_pages[0]
resp = http.get(detail_url, timeout=60, headers={"Referer": YEAR_INDEX})
resp.raise_for_status()
dom = BeautifulSoup(resp.text, "html.parser")
# Try common selectors / texts
download_anchor = (dom.select_one('a.download_pdf[href]')
                   or dom.select_one('a[href*="/do/gazzetta/downloadPdf"]'))
if not download_anchor:
    for node in dom.find_all("a", href=True):
        if "scarica il pdf" in (node.get_text() or "").lower():
            download_anchor = node
            break
print("Download link found on detail page?", bool(download_anchor))
if download_anchor:
    print("Download href:", urljoin(ORIGIN, download_anchor["href"]))

This produces a complete list of detail pages but consistently no download link on those pages for 1969. Manually constructing the download URL for those issues returns HTML with the “not found” message, not a binary PDF.

Why this happens

On these historical pages the effective download anchors are not reliably present in the server-rendered HTML that requests-based scraping sees. UI elements like the year picker can be obscured or embedded, and in some sessions the expected download_pdf links never materialize. Constructing the downloadPdf URL by hand produces HTML responses such as “Il pdf selezionato non è stato trovato”, which means the endpoint rejects the request as formed. In this situation a browser-driven approach is appropriate. It allows the site to execute whatever client-side logic it needs and to establish the exact session context the server expects. As noted by practitioners, this may be necessary when the server depends on cookies, headers, or when pages differ structurally across the archive.

Working approach with Selenium

The script below opens the official “Formato grafico PDF” search, selects the year via keyboard navigation, submits the form, parses the resulting page to extract the real a.download_pdf anchors, and then requests each download URL in the same browser session. Set save_dir to the directory where you want the PDFs stored. The headless mode is enabled, and a simple fixed sleep is used to wait for each download, with an optional helper to detect completion via .crdownload files.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time
from lxml import html
SEARCH_URL = "https://www.gazzettaufficiale.it/ricerca/pdf/foglio_ordinario2/2/0/0?reset=true"
save_dir = "/home/lmc/tmp/test-ws/gaz"
chrome_cfg = webdriver.ChromeOptions()
chrome_cfg.add_argument("start-maximized")
chrome_cfg.add_argument("window-size=2880x1620")
chrome_cfg.add_argument("--headless")
chrome_cfg.set_capability("pageLoadStrategy", "normal")
chrome_cfg.add_argument("--enable-javascript")
cfg_prefs = {
    "profile.managed_default_content_settings.images": 2,
    "permissions.default.stylesheet": 2,
    "download.default_directory": save_dir,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
}
chrome_cfg.add_experimental_option("prefs", cfg_prefs)
browser = webdriver.Chrome(options=chrome_cfg)
browser.implicitly_wait(30)
browser.get(SEARCH_URL)
try:
    year_select = browser.find_element(By.ID, 'annoPubblicazione')
    year_select.click()
    time.sleep(2)
    chain = ActionChains(browser)
    for _ in range(17):
        chain.send_keys(Keys.ARROW_DOWN)
    chain.send_keys(Keys.ENTER)
    chain.perform()
    search_btn = browser.find_element(By.XPATH, '//input[@name="cerca"]')
    search_btn.click()
    time.sleep(2)
    markup = browser.page_source
    tree = html.fromstring(markup)
    links = tree.xpath('//a[@class="download_pdf"]/@href')
    print(f"first link: {links[0] if links else 'none'}")
    print(f"total links: {len(links)}")
    for path in links:
        pdf_url = f"https://www.gazzettaufficiale.it{str(path)}"
        print(f"Downloading: {pdf_url}")
        browser.get(pdf_url)
        time.sleep(8)
except Exception as exc:
    print("Unexpected error during processing")
    raise exc
finally:
    browser.quit()

If downloads appear incomplete on your machine, use a simple polling helper to wait until Chrome finishes writing files by checking for temporary extensions. The snippet below uses a timeout and a configurable polling interval.

def monitor_downloads(dir_path, timeout=60, poll_each=1):
    import glob
    start = time.time()
    while glob.glob(f"{dir_path}/*.crdownload") and time.time() - start < timeout:
        time.sleep(poll_each)
    print(f"Download complete in {time.time() - start:.2f} seconds.")

Why this matters

Working directly against the download endpoint without the right session context can silently return HTML error pages that, if saved with a .pdf extension, turn into hundreds of corrupted files. A browser-controlled flow avoids this pitfall by reproducing the same steps a real user performs: selecting the year within the official UI, letting the site populate the results, and following the anchored links generated for that session.

Practical takeaways

When the archive view shows no download_pdf anchors to a plain HTTP client, don’t spend time reverse-engineering ad hoc URLs that respond with “Il pdf selezionato non è stato trovato”. Prefer a Selenium-driven workflow that relies on the site’s own UI to reveal the valid links and maintains the cookies and headers the server expects. If direct selection of dropdown options is flaky, use keyboard navigation and Enter to commit the choice. Adjust the waiting strategy for downloads to your environment, either by increasing the sleep or by polling for temporary files until the browser completes the writes.

With this approach you can reproducibly fetch the full “pubblicazione completa non certificata” PDFs for 1969 Serie Generale issues, headless and without manual intervention.

The article is based on a question from StackOverflow by Mark and an answer by LMC.