2025, Oct 26 15:00

Polars Copy-on-Write in Practice: DataFrame immutability, column cloning, and memory reuse

Learn how Polars copy-on-write works: DataFrames stay independent while columns share or clone memory. Explore with_columns, n_chunks, and performance tips.

Polars introduces copy-on-write semantics that feel familiar to systems developers, yet they can be counterintuitive when you expect fine-grained in-place updates. The subtlety is not about whether objects look independent, but about when memory is actually cloned and when buffers are reused. Below is a hands-on walkthrough that clarifies how DataFrames and their columns behave as you branch and transform them.

Reproducing the scenario

The example starts with a simple DataFrame, then creates views and modifies a single column. Every step uses new variable names so it’s clear when a new object is produced, and what memory can still be shared under copy-on-write.

import polars as pl

base_df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

view_df = base_df  # Treat as a copy semantically; under the hood it references the same data.

view_df = view_df.with_columns(
    pl.Series([7, 8, 9]).alias("b")
)  # Copy-on-write applies to column b only; a remains shared.

final_df = view_df  # Semantically a copy; still referencing view_df buffers under the hood.

final_df = final_df.with_row_index("rid")  # Add a temporary row index for conditional edits.

final_df = final_df.with_columns(
    pl.when(pl.col("rid") == 0)
    .then(10)
    .otherwise(pl.col("b"))
    .alias("b")
)  # Produces a new Series for b with the first value changed to 10.

final_df = final_df.with_columns(
    pl.when(pl.col("rid") == 1)
    .then(11)
    .otherwise(pl.col("b"))
    .alias("b")
)  # Produces another new Series for b with the second value set to 11.

final_df = final_df.drop("rid")  # Remove the helper index column.

What actually happens

The essential model is that Polars uses copy-on-write. As long as a column has not been modified, different DataFrames created via assignment or transformations may continue to reference the exact same memory chunk. In this sequence, column a is never changed, so base_df, view_df, and final_df still share the same chunk for a. That’s the heart of the optimization: no unnecessary copying for untouched data.

Columns behave like atomic units. Any change to a column produces a new Series and therefore a new chunk, and the DataFrame you assign it to becomes a new object that now references that new chunk. This also means with_columns always returns a new DataFrame; even if only one element changes, the modified Series is new memory. In other words, updates are not in-place.

Clearing up a common misconception

A frequent misread is to say that “only one column a exists right now.” If by that you mean there is a single column chunk shared across all these DataFrames, then the statement aligns with how Polars behaves in this example. There is still one shared chunk for a, and each DataFrame points to it. The distinction matters because the DataFrames are semantically independent, yet they can reuse identical buffers until a mutation touches them.

How to reason about copy-on-write here

First, observe that you never touched a after the initial DataFrame creation. That’s why its memory is still shared. Second, every time you assign a new b—either by replacing it with a fresh Series or by producing it via a conditional expression—you get a new chunk for b. The previous b stays intact for any DataFrame still referencing it. Polars tracks ownership at the chunk level; if an expression would impact the original buffer, the chunk is cloned, otherwise it is reused.

You can make this more tangible by trying the same workflow with very large integer or float columns. With just a few rows, memory usage differences are negligible. With on the order of 100M rows of 64-bit values, you will visibly notice when roughly ~800MB is copied or not, depending on whether a column is cloned or shared. To inspect how many chunks each column currently has, use the DataFrame.n_chunks facility, which helps verify what is being reused and what has been cloned.

Corrected view of the process

Summarizing the lifecycle in the example yields a consistent mental model: DataFrame assignment creates a new handle that still references original buffers; modifying b produces a new Series and a new DataFrame; further conditional updates to b keep producing new Series; a remains shared because it was never modified. The independence is semantic at the DataFrame level, while physical memory reuse happens at the column-chunk level until a mutation requires cloning.

Why this matters

Understanding this model lets you reason about performance and memory behavior without resorting to guesswork. It explains why some operations are fast—no data is copied if you don’t touch a column—and why selective updates are efficient: only the affected columns are cloned. It also clarifies why you should not expect single-element, in-place mutation semantics; the result of each transformation is a new DataFrame with new Series where necessary, which is consistent with immutability and copy-on-write.

Practical advice

Treat DataFrames and Series as immutable, and think in terms of column-level chunks. Expect with_columns to return a new DataFrame and expect modified columns to be new Series. If a column remains untouched across clones, it remains shared. When in doubt, check the number of chunks per column and try large-scale tests to observe memory reuse versus cloning in a way that’s easy to spot.

The article is based on a question from StackOverflow by user2961927 and an answer by Aren.

polars python python-polars