2025, Oct 08 03:00
How to Scrape Paginated DataTables with Selenium in Python: Avoid Skipped Pages Using the Next Button
Use a robust Selenium approach to scrape Vale's dividends table in Python: avoid brittle idx clicks and iterate with the Next button to capture every page.
Scraping paginated tables with Selenium often looks straightforward until a widget’s internal behaviors surface. On the Vale dividends page at https://investidor10.com.br/acoes/vale3/, the pagination UI includes numeric buttons and Next/Previous. Clicking by numeric index works for the first few pages, but attempting to hit idx="5" jumps straight to idx="8" and leaves data on pages 6 and 7 behind. On top of that, an occasional NoSuchElementError appears even though the target element is visible in the DOM.
Reproducing the issue
The following snippet relies on direct clicks against numeric pager anchors via the data-dt-idx attribute. It scrolls, clicks the chosen page, waits for the table to be present, and hands control to the scraping routine.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
def loop_pagers():
    tabs = browser.find_elements(By.CSS_SELECTOR, "a[data-dt-idx]")
    total = len(tabs)
    for k in range(total):
        hit_pager(str(k + 1))
def hit_pager(idx):
    try:
        locator = (By.CSS_SELECTOR, f'a[data-dt-idx="{idx}"]')
        pager = WebDriverWait(browser, 10).until(EC.presence_of_element_located(locator))
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        sleep(1)
        browser.execute_script("arguments[0].scrollIntoView({behavior:'instant', block:'center' });", pager)
        browser.execute_script("arguments[0].click();", pager)
        WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "table-dividends-history")))
        harvest_grid()  # scraping routine
    except Exception as err:
        print(f"Failed to execute function: {err}")
What’s actually going wrong
The pagination control doesn’t behave consistently when targeted by numeric idx. The observed result is that a click intended for idx="5" takes the view to idx="8", so rows from pages 6 and 7 are skipped. Additionally, the code can raise NoSuchElementError even when the button is on the page, which lines up with dynamic re-rendering and transient clickability issues. In short, the numeric-index strategy is brittle for this widget.
A pragmatic way to collect every page
A more resilient pattern on this page is to capture all visible rows on the first page and then advance page-by-page with the Next button until it is no longer present. The approach starts by navigating to https://investidor10.com.br/acoes/vale3/, waiting for the dividends section (id="dividends-section") to become visible, scrolling the dividends table wrapper into view, and reading the header from the dataTables_scroll area.
From there, the code scrapes the first page and then enters a loop. Each iteration attempts to click the Next control identified by the selector #table-dividends-history_paginate > a.paginate_button.next. After a successful click, it pauses briefly to allow the table to update, and then scrapes the newly visible rows. If the Next button can’t be found, the loop ends, meaning all pages have been processed. If the click is intercepted, it silently retries in the next loop iteration. Finally, all rows are assembled into a pandas DataFrame with the discovered header.
import time
import pandas as pd
from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
rows_agg = []
def harvest_rows(tbl):
    table_rows = tbl.find_elements(By.CSS_SELECTOR, "div.dataTables_scrollBody>table>tbody>tr")
    for row in table_rows:
        rows_agg.append([d.text for d in row.find_elements(By.TAG_NAME, 'td')])
chrome_cfg = ChromeOptions()
chrome_cfg.add_argument("--start-maximized")
chrome_cfg.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_cfg.add_experimental_option("useAutomationExtension", False)
browser = Chrome(options=chrome_cfg)
waiter = WebDriverWait(browser, 10)
target_url = "https://investidor10.com.br/acoes/vale3/"
browser.get(target_url)
waiter.until(EC.visibility_of_element_located((By.ID, "dividends-section")))
widget_wrap = browser.find_element(By.ID, "table-dividends-history_wrapper")
browser.execute_script("arguments[0].scrollIntoView(true);", widget_wrap)
grid_box = widget_wrap.find_element(By.CSS_SELECTOR, "div.dataTables_scroll")
headers = grid_box.find_element(By.CSS_SELECTOR, "div.dataTables_scrollHead").text.split('\n')
print(f"Table Header {headers}")
print("Extracting Page 1...")
harvest_rows(grid_box)
page_counter = 2
has_next = True
while has_next:
    try:
        next_btn = widget_wrap.find_element(By.CSS_SELECTOR, '#table-dividends-history_paginate>a[class="paginate_button next"]')
        try:
            next_btn.click()
            time.sleep(1)
            print(f"Extracting Page {page_counter}...")
            harvest_rows(grid_box)
            page_counter += 1
        except ElementClickInterceptedException:
            pass
    except NoSuchElementException:
        print("Reached End Page")
        has_next = False
# assemble and show the table
df = pd.DataFrame(rows_agg, columns=headers)
print(df)
A sample run logs each page extraction and reports when the final page is reached. It then prints a single DataFrame containing all rows from all pages under the discovered columns.
Table Header ['TIPO', 'DATA COM', 'PAGAMENTO', 'VALOR']
Extracting Page 1...
Extracting Page 2...
Extracting Page 3...
Extracting Page 4...
Extracting Page 5...
Extracting Page 6...
Extracting Page 7...
Extracting Page 8...
Reached End Page
Why this matters
Pagination UIs that dynamically reflow, relabel, or re-render are hostile to index-based clicking. Iterating via Next avoids misaligned idx jumps and ensures that every page is visited exactly once. It also streamlines scraping by reusing a single element context to harvest rows. Be aware that a fixed one-second pause after clicking is a best-effort delay and can be unreliable; this point is important to keep in mind when running under varying network or rendering conditions.
Takeaways
When a paginator doesn’t respond deterministically to numeric index clicks, switch to a deterministic traversal strategy. Load the section you need, bring the table into view, collect headers and rows from the first page, and advance through the Next control until it disappears. Aggregate results as you go and build the final DataFrame at the end. Keep a close eye on static sleeps during pagination; they can be the weak link if the page load timing fluctuates.
The article is based on a question from StackOverflow by user30126350 and an answer by Ajeet Verma.