2025, Dec 17 17:00

Why Your Selenium Google Scraper Returns an Empty List and How a Single CSS Selector Fix Restores It

Learn why Selenium returns empty Google results: a child vs descendant CSS selector mismatch. See the fix, plus waits and user-agent tips for SERP scraping.

Scraping Google result links to collect school websites and emails often looks straightforward until the script returns an empty list with no obvious error. Below is a precise breakdown of one such pitfall in Selenium and how to resolve it without changing the overall scraping logic.

Problem statement

The task is to query Google for pages such as PSHE on .sch.uk domains and extract result URLs. The initial scraper launches a headless Chrome, builds a search URL, and locates anchors in the results container. However, it returns nothing.

Reproducible code that fails

The following snippet mirrors the core behavior of the original approach while keeping the program logic intact and using different names:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time

def crawl_google(q, limit=10):
    opts = Options()
    opts.add_argument("--headless")  # Run headless browser
    opts.add_argument("--disable-blink-features=AutomationControlled")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")

    browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=opts)

    target = f"https://www.google.com/search?q={q}&num={limit}"
    browser.get(target)

    time.sleep(2)

    collected = []
    nodes = browser.find_elements(By.CSS_SELECTOR, 'div.yuRUbf > a')
    for node in nodes:
        href = node.get_attribute('href')
        if href:
            collected.append(href)

    browser.quit()
    return collected

phrase = "PSHE site:.sch.uk"
out = crawl_google(phrase, limit=20)

for idx, link in enumerate(out, 1):
    print(f"{idx}. {link}")

What actually goes wrong

The locator uses the child combinator in the CSS selector. The expression div.yuRUbf > a requests an <a> that is a direct child of the given <div>. On the results page, the anchor is not a direct child but a deeper descendant. Because the relationship is not parent → direct child, the selection yields no matches and the script quietly produces an empty list.

There is an additional practical concern. Class names such as yuRUbf are obfuscated and subject to change, so relying on them is brittle. It is also possible to encounter a reCaptcha after the initial navigation, which prevents the expected markup from loading. In such cases, checking page_source right after navigation or temporarily disabling headless mode helps to see what the browser actually receives.

The fix: adjust the selector

Switching from the child combinator to a descendant combinator resolves the core issue. The selector div.yuRUbf a matches anchors nested at any depth under the specified container.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time

def crawl_google(q, limit=10):
    opts = Options()
    opts.add_argument("--headless")
    opts.add_argument("--disable-blink-features=AutomationControlled")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")

    browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=opts)

    target = f"https://www.google.com/search?q={q}&num={limit}"
    browser.get(target)

    time.sleep(2)

    collected = []
    nodes = browser.find_elements(By.CSS_SELECTOR, 'div.yuRUbf a')
    for node in nodes:
        href = node.get_attribute('href')
        if href:
            collected.append(href)

    browser.quit()
    return collected

phrase = "PSHE site:.sch.uk"
out = crawl_google(phrase, limit=20)

for idx, link in enumerate(out, 1):
    print(f"{idx}. {link}")

A more robust variant of the same logic

Replacing arbitrary sleeps with waits, avoiding an extra dependency by using Selenium Manager, and sending a user-agent string improve stability. The following version keeps identical behavior while incorporating those changes and the corrected selector:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def fetch_serp(q, limit=10):
    chrome_opts = Options()
    chrome_opts.add_argument("--headless")
    chrome_opts.add_argument("--disable-blink-features=AutomationControlled")
    chrome_opts.add_argument("--no-sandbox")
    chrome_opts.add_argument("--disable-dev-shm-usage")
    chrome_opts.add_argument(
        "user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
    )

    browser = webdriver.Chrome(options=chrome_opts)
    page = f"https://www.google.com/search?q={q}&num={limit}"
    browser.get(page)
    browser.maximize_window()
    wait_for = WebDriverWait(browser, 10)

    hrefs = []
    elements = wait_for.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.yuRUbf a')))
    for el in elements:
        href = el.get_attribute('href')
        if href:
            hrefs.append(href)

    browser.quit()
    return hrefs

term = "PSHE site:.sch.uk"
serp = fetch_serp(term, limit=20)

for idx, u in enumerate(serp, 1):
    print(f"{idx}. {u}")

Why this knowledge matters

The difference between a child and a descendant combinator is small in syntax but critical in effect. A single character error can collapse an entire extraction step without raising an exception. Aligning the locator with the actual DOM structure restores the flow immediately. On top of that, replacing fixed sleeps with waits leads to more predictable runs, and using the built-in Selenium Manager removes the need for an extra driver manager. Providing a user-agent helps in scenarios where the target rejects a default automated signature. When scraping search results, it is also useful to be aware that the markup may not load as expected due to protective challenges; viewing the retrieved HTML or switching off headless mode is a practical way to verify what the browser is really seeing.

Conclusion

If your Google scraper returns an empty list, first validate the selector semantics. Use a descendant selector when the anchor is not a direct child. Prefer explicit waits over arbitrary delays, and simplify the driver setup with the built-in manager while sending a realistic user-agent. Be cautious with obfuscated class names and remember that protective pages can alter what the automation receives. These small, surgical changes are often enough to turn an empty result into a working scrape without redesigning your workflow.

python selenium-webdriver web-scraping