https://pytroubles.com/en/posts/id1881-memory-efficient-csv-xlsx-previews-pandas-vs-polars-early-exit-reads-openpyxl-best-practices

Memory-Efficient CSV/XLSX Previews: Pandas vs Polars, Early Exit Reads, openpyxl Best Practices

How to Build Memory-Lean CSV and Excel (XLSX) Previews: Pandas vs Polars, openpyxl, Early-Exit Tactics

Memory-Efficient CSV/XLSX Previews: Pandas vs Polars, Early Exit Reads, openpyxl Best Practices

Learn how to generate fast, memory-efficient previews of CSV and Excel files using pandas, polars, and openpyxl. Compare engines, nrows and early-exit tactics.

2025-11-14T21:00:10+03:00

2025-11-14T21:00:11+03:00

Generating a lightweight preview for user-uploaded spreadsheets sounds simple until memory blows up under real-world files. When the CSV is around 30k rows and the XLSX reaches 1 million rows, the choice of reader and settings becomes the difference between snappy previews and a process hitting hundreds of megabytes. Below is a practical walkthrough of what happens with pandas and polars, how to reproduce it, and which options reliably keep memory usage in check for a “first N rows” preview.Reproducing the problemThe following test file is designed to compare pandas and polars on CSV and XLSX inputs under pytest-memray. It includes full reads and partial reads for a side-by-side view. The logic is unchanged from the baseline; names are different for clarity.import io import pytest import pandas as pd import polars as pl @pytest.fixture def csv_path() -> str: return "files/rows.csv" @pytest.fixture def xlsx_path() -> str: return "files/bankdataset.xlsx" def test_pandas_csv_full(csv_path: str): pd.read_csv(csv_path) def test_polars_csv_full(csv_path: str): pl.scan_csv(csv_path, low_memory=True).collect() def test_pandas_xlsx_full(xlsx_path: str): pd.read_excel(xlsx_path, sheet_name=None) def test_polars_xlsx_full(xlsx_path: str): pl.read_excel(xlsx_path, sheet_name=None) def test_pandas_csv_head(csv_path: str): frame = pd.read_csv(csv_path, nrows=20) assert len(frame) == 20 def test_polars_csv_head(csv_path: str): frame = pl.scan_csv(csv_path).head(n=20).collect() assert len(frame) == 20 def test_pandas_xlsx_head(xlsx_path: str): frame = pd.read_excel(xlsx_path, nrows=20) assert len(frame) == 20 def test_polars_xlsx_head(xlsx_path: str): frame = pl.read_excel(xlsx_path).head(n=20) assert len(frame) == 20 def test_polars_csv_head_via_buffer(csv_path: str): with io.BytesIO() as buf: pl.scan_csv(csv_path).limit(20).sink_csv(buf) frame = pl.read_csv(buf.getvalue()) assert len(frame) == 20 In the recorded runs, pandas showed lower allocations than polars for both CSV and XLSX in these specific scenarios. The XLSX cases were especially telling: polars with the default Excel engine hit roughly 473 MiB at the high watermark, while pandas was about 426 MiB. CSV results were closer, but pandas still showed favorable numbers in those tests.What’s actually going onIf the target is a preview of the first N rows, the most important behavior is whether the reader stops early when it has enough data. For CSV, both ecosystems can stream effectively, but details of how and when materialization happens matter. For Excel, the engine and its early-exit behavior are critical.Pandas read_excel uses openpyxl by default and stops parsing early when nrows is provided. This early break is visible in pandas’ source code and is a key reason it avoids pulling the entire sheet when you only need a preview. The relevant lines are here: pandas/io/excel/_openpyxl.py, which include the short-circuit:if file_rows_needed is not None and len(data) >= file_rows_needed: break Polars read_excel uses fastexcel as the default engine. It supports its own n_rows via read_options — for example, pl.read_excel(..., read_options={"n_rows": 20}) — but in testing that did not reduce memory usage. Polars can switch to engine="openpyxl", but according to its implementation, it reads all data, which defeats the early-exit optimization for previews. You can inspect the behavior here: polars/io/spreadsheet/functions.py.openpyxl itself has an efficient way to cap reads at the worksheet level. Using iter_rows(max_row=...) reduces how much data is traversed and, in testing, used slightly less memory than stopping by breaking out of the loop after a reader had already advanced further.A practical path to memory‑lean previewsIf all you need is to render the first N rows, two choices consistently minimized allocations in testing. For CSVs, rely on the standard library csv module to read exactly the number of rows you want. For Excel, use openpyxl directly and limit the iteration with max_row.Here is a minimal pattern for CSV previews with the standard library:import csv def preview_csv_head(csv_file: str, n: int = 20): rows = [] with open(csv_file, newline="", encoding="utf-8") as fh: reader = csv.reader(fh) for idx, r in enumerate(reader): rows.append(r) if idx + 1 >= n: break return rows For Excel, openpyxl’s read-only mode with iter_rows(max_row=...) keeps memory usage tight and avoids reading the entire sheet. The example below mirrors the approach that measured slightly lower memory usage than an early break:from openpyxl import load_workbook def preview_xlsx_head(xlsx_file: str, limit: int = 20): opts = {"read_only": True, "data_only": True, "keep_links": False} wb = load_workbook(xlsx_file, **opts) previews = {} for ws_name in wb.sheetnames: ws = wb[ws_name] sample = [] for row in ws.iter_rows(max_row=limit + 1): sample.append([cell.value for cell in row]) previews[ws_name] = sample return previews If you prefer to stay inside pandas for convenience, you can leverage the early break with nrows:import pandas as pd def preview_xlsx_with_pandas(xlsx_file: str, n: int = 20): df = pd.read_excel(xlsx_file, nrows=n) return df With polars, you can try the fastexcel n_rows option for completeness. The call looks like this, though tests indicated it did not improve memory in this scenario:import polars as pl def preview_xlsx_with_polars(xlsx_file: str, n: int = 20): frame = pl.read_excel(xlsx_file, read_options={"n_rows": n}) return frame Why this matters for production previewsPreview features run frequently and at the edges of user input size. Small inefficiencies compound into higher memory pressure and slower response times. XLSX is particularly sensitive because readers may parse entire sheets before you can slice down to the first rows, which is exactly what a preview does not need. Choosing a reader that stops as soon as it has the requested rows is the simplest and most impactful optimization for this use case.Takeaways and recommendationsWhen your objective is a memory-thrifty “first N rows” preview, the safest defaults are the standard library csv reader for CSVs and openpyxl with iter_rows(max_row=...) for Excel files. If you already use pandas, read_excel(..., nrows=...) benefits from openpyxl’s early break and fits the same goal. With polars, be aware that the default fastexcel path and the openpyxl engine differ in early-exit behavior; in testing, the fastexcel n_rows option did not change memory usage and the openpyxl engine read all data. For large uploads and responsive previews, pick the path that reads no more than you need.

memory-efficient previews, CSV preview, Excel preview, XLSX, pandas read_excel nrows, polars read_excel, openpyxl iter_rows max_row, fastexcel, early exit, first N rows, streaming CSV, low memory

2025

2025, Nov 14 21:00

How to Build Memory-Lean CSV and Excel (XLSX) Previews: Pandas vs Polars, openpyxl, Early-Exit Tactics

Learn how to generate fast, memory-efficient previews of CSV and Excel files using pandas, polars, and openpyxl. Compare engines, nrows and early-exit tactics.

Reproducing the problem

The following test file is designed to compare pandas and polars on CSV and XLSX inputs under pytest-memray. It includes full reads and partial reads for a side-by-side view. The logic is unchanged from the baseline; names are different for clarity.

import io
import pytest
import pandas as pd
import polars as pl
@pytest.fixture
def csv_path() -> str:
    return "files/rows.csv"
@pytest.fixture
def xlsx_path() -> str:
    return "files/bankdataset.xlsx"
def test_pandas_csv_full(csv_path: str):
    pd.read_csv(csv_path)
def test_polars_csv_full(csv_path: str):
    pl.scan_csv(csv_path, low_memory=True).collect()
def test_pandas_xlsx_full(xlsx_path: str):
    pd.read_excel(xlsx_path, sheet_name=None)
def test_polars_xlsx_full(xlsx_path: str):
    pl.read_excel(xlsx_path, sheet_name=None)
def test_pandas_csv_head(csv_path: str):
    frame = pd.read_csv(csv_path, nrows=20)
    assert len(frame) == 20
def test_polars_csv_head(csv_path: str):
    frame = pl.scan_csv(csv_path).head(n=20).collect()
    assert len(frame) == 20
def test_pandas_xlsx_head(xlsx_path: str):
    frame = pd.read_excel(xlsx_path, nrows=20)
    assert len(frame) == 20
def test_polars_xlsx_head(xlsx_path: str):
    frame = pl.read_excel(xlsx_path).head(n=20)
    assert len(frame) == 20
def test_polars_csv_head_via_buffer(csv_path: str):
    with io.BytesIO() as buf:
        pl.scan_csv(csv_path).limit(20).sink_csv(buf)
        frame = pl.read_csv(buf.getvalue())
        assert len(frame) == 20

In the recorded runs, pandas showed lower allocations than polars for both CSV and XLSX in these specific scenarios. The XLSX cases were especially telling: polars with the default Excel engine hit roughly 473 MiB at the high watermark, while pandas was about 426 MiB. CSV results were closer, but pandas still showed favorable numbers in those tests.

What’s actually going on

If the target is a preview of the first N rows, the most important behavior is whether the reader stops early when it has enough data. For CSV, both ecosystems can stream effectively, but details of how and when materialization happens matter. For Excel, the engine and its early-exit behavior are critical.

Pandas read_excel uses openpyxl by default and stops parsing early when nrows is provided. This early break is visible in pandas’ source code and is a key reason it avoids pulling the entire sheet when you only need a preview. The relevant lines are here: pandas/io/excel/_openpyxl.py, which include the short-circuit:

if file_rows_needed is not None and len(data) >= file_rows_needed:
    break

Polars read_excel uses fastexcel as the default engine. It supports its own n_rows via read_options — for example, pl.read_excel(..., read_options={"n_rows": 20}) — but in testing that did not reduce memory usage. Polars can switch to engine="openpyxl", but according to its implementation, it reads all data, which defeats the early-exit optimization for previews. You can inspect the behavior here: polars/io/spreadsheet/functions.py.

openpyxl itself has an efficient way to cap reads at the worksheet level. Using iter_rows(max_row=...) reduces how much data is traversed and, in testing, used slightly less memory than stopping by breaking out of the loop after a reader had already advanced further.

A practical path to memory‑lean previews

If all you need is to render the first N rows, two choices consistently minimized allocations in testing. For CSVs, rely on the standard library csv module to read exactly the number of rows you want. For Excel, use openpyxl directly and limit the iteration with max_row.

Here is a minimal pattern for CSV previews with the standard library:

import csv
def preview_csv_head(csv_file: str, n: int = 20):
    rows = []
    with open(csv_file, newline="", encoding="utf-8") as fh:
        reader = csv.reader(fh)
        for idx, r in enumerate(reader):
            rows.append(r)
            if idx + 1 >= n:
                break
    return rows

For Excel, openpyxl’s read-only mode with iter_rows(max_row=...) keeps memory usage tight and avoids reading the entire sheet. The example below mirrors the approach that measured slightly lower memory usage than an early break:

from openpyxl import load_workbook
def preview_xlsx_head(xlsx_file: str, limit: int = 20):
    opts = {"read_only": True, "data_only": True, "keep_links": False}
    wb = load_workbook(xlsx_file, **opts)
    previews = {}
    for ws_name in wb.sheetnames:
        ws = wb[ws_name]
        sample = []
        for row in ws.iter_rows(max_row=limit + 1):
            sample.append([cell.value for cell in row])
        previews[ws_name] = sample
    return previews

If you prefer to stay inside pandas for convenience, you can leverage the early break with nrows:

import pandas as pd
def preview_xlsx_with_pandas(xlsx_file: str, n: int = 20):
    df = pd.read_excel(xlsx_file, nrows=n)
    return df

With polars, you can try the fastexcel n_rows option for completeness. The call looks like this, though tests indicated it did not improve memory in this scenario:

import polars as pl
def preview_xlsx_with_polars(xlsx_file: str, n: int = 20):
    frame = pl.read_excel(xlsx_file, read_options={"n_rows": n})
    return frame

Why this matters for production previews

Preview features run frequently and at the edges of user input size. Small inefficiencies compound into higher memory pressure and slower response times. XLSX is particularly sensitive because readers may parse entire sheets before you can slice down to the first rows, which is exactly what a preview does not need. Choosing a reader that stops as soon as it has the requested rows is the simplest and most impactful optimization for this use case.

Takeaways and recommendations

When your objective is a memory-thrifty “first N rows” preview, the safest defaults are the standard library csv reader for CSVs and openpyxl with iter_rows(max_row=...) for Excel files. If you already use pandas, read_excel(..., nrows=...) benefits from openpyxl’s early break and fits the same goal. With polars, be aware that the default fastexcel path and the openpyxl engine differ in early-exit behavior; in testing, the fastexcel n_rows option did not change memory usage and the openpyxl engine read all data. For large uploads and responsive previews, pick the path that reads no more than you need.

python python-polars