2025, Oct 30 05:00

Normalize person names in Polars using coalesce to merge split columns with parsed Last, First strings

Learn how to normalize person names in Polars using coalesce: prefer existing split columns and fall back to parsing 'Last, First' strings; fewer conditionals.

Polars makes it straightforward to normalize string fields, but mixing mutually exclusive inputs often leads to verbose branching. A common case is person names: some rows carry a single comma-delimited string, others already have first and last name split across columns. The goal is to converge on consistent columns without a tangle of repeated conditionals.

Problem setup

We have a dataset where some rows provide a full name as a single string in the format “Last, First”, while others already separate last and first names. We want to end up with two clean columns, using the original split columns when present and falling back to parsing the full name otherwise.

import polars as pl

rows = [
    {"name_full": "McCartney, Paul"},
    {"name_last": "Lennon", "name_first": "John"},
    {"name_full": "Starr, Ringo"},
    {"name_last": "Harrison", "name_first": "George"}
]
people_df = pl.DataFrame(rows)

Why the naive approach hurts

One way is to split the full name, strip whitespace, and then branch per target column. That works, but each destination column requires its own conditional, which quickly becomes unwieldy as the schema grows.

(
    people_df.with_columns(
        pl.col("name_full")
          .str.split(",")
          .list.eval(pl.element().str.strip_chars())
          .alias("pieces")
    ).with_columns(
        pl.when(pl.col("name_last").is_null())
          .then(pl.col("pieces").list.get(0, null_on_oob=True))
          .otherwise(pl.col("name_last")).alias("name_last"),
        pl.when(pl.col("name_first").is_null())
          .then(pl.col("pieces").list.get(1, null_on_oob=True))
          .otherwise(pl.col("name_first")).alias("name_first")
    ).select(pl.all().exclude("name_full", "pieces"))
)

Functionally it’s fine, but every additional field introduces another branch, adding boilerplate and visual noise.

A concise pattern with coalesce

When the full name follows the comma pattern, it’s enough to keep a parsed list of tokens and then prefer existing split columns when they’re already filled. The key is to rely on coalesce, which returns the first non-null value. That eliminates repetitive conditionals while preserving intent.

people_df.with_columns(
    pl.col("name_full")
      .str.split(",")
      .list.eval(pl.element().str.strip_chars())
      .alias("pieces")
).select(
    pl.coalesce(
        pl.col("name_last"),
        pl.col("pieces").list.get(0, null_on_oob=True)
    ).alias("name_last"),
    pl.coalesce(
        pl.col("name_first"),
        pl.col("pieces").list.get(1, null_on_oob=True)
    ).alias("name_first")
)

This keeps the parsing step explicit and small, then folds the merge logic into a pair of readable expressions. The list.get calls use null_on_oob=True to avoid errors when a row doesn’t carry a full name.

Why this matters

Data engineering pipelines often need to reconcile heterogeneous inputs into a canonical schema. Compact expressions reduce the maintenance burden, lower the chance of mistakes, and keep intent clear. In Polars, leaning on coalesce helps you express “prefer this column, otherwise take the parsed fallback” without duplicating branching logic for each target field.

Takeaways

When a column like name_full uses a consistent delimiter, parse it once, keep the intermediate array, and then use coalesce to select the first available value between the original split columns and the parsed tokens. This scales cleanly as you add fields and keeps your transformation steps straightforward and easy to audit.

The article is based on a question from StackOverflow by dewser_the_board and an answer by Jonathan.

python python-polars