2025, Dec 06 13:00

Fill Missing Event Dates in Pandas by Averaging Treatment-to-Event Delta and Updating Only NaT

Learn how to impute missing event dates in pandas by averaging treatment-event time deltas and safely updating only NaT values with update overwrite=False.

Imputing event dates from reliable treatment timestamps is a common task in EDA and regression workflows, especially when the event field has gaps. If you already have both columns in pandas datetime and have computed the average gap between them, you can fill only the missing entries without touching the valid ones.

Problem setup

There are two datetime columns: one for the event that may be missing, and another for the treatment that is accurate. The goal is to compute the average time difference between treatment and event where both exist, then use that average to populate the missing event dates, leaving existing values intact.

Minimal example

Consider a small dataframe with several missing event dates:

>>> tbl
  incident_dt therapy_dt
0   2025-02-20  2025-04-29
1          NaT  2025-05-22
2          NaT  2025-05-27
3   2025-03-14  2025-04-26
4          NaT  2025-05-11
5   2025-02-16  2025-04-23
6          NaT  2025-05-04
7          NaT  2025-05-20
8   2025-03-18  2025-05-28
9   2025-02-22  2025-05-21

What’s going on and why it matters

The dataset mixes complete rows with rows missing the event date. You don’t want to drop those rows because the column is important for analysis. Instead, compute an average time delta from rows that do have both dates, then apply that delta consistently to fill only the missing values. Using pandas for this lets you keep the original values and operate vectorially on thousands of rows.

Solution

First compute the average difference on rows where both dates are present. Then build a working copy of the two columns, set the event date in that copy to the treatment date plus the average delta, and finally push those filled values back into the original dataframe, updating only the missing entries.

>>> subset = tbl[tbl["incident_dt"].notna()]
>>> mean_delta = (subset["therapy_dt"] - subset["incident_dt"]).mean().round('d')
>>> mean_delta
Timedelta('67 days 00:00:00')
>>> tmp = tbl[["incident_dt", "therapy_dt"]].copy()
>>> tmp["incident_dt"] = tmp["therapy_dt"] + mean_delta
>>> tbl.update(tmp, overwrite=False)
>>> tbl
  incident_dt therapy_dt
0   2025-02-20  2025-04-29
1   2025-07-28  2025-05-22
2   2025-08-02  2025-05-27
3   2025-03-14  2025-04-26
4   2025-07-17  2025-05-11
5   2025-02-16  2025-04-23
6   2025-07-10  2025-05-04
7   2025-07-26  2025-05-20
8   2025-03-18  2025-05-28
9   2025-02-22  2025-05-21

The key piece is update with overwrite set to False. That ensures existing, non-missing incident_dt values remain unchanged while NaT entries are filled from the copy. The round('d') call standardizes the average offset to whole days.

Why this is useful

This approach gives you a simple, reproducible imputation step that plays nicely with downstream EDA and regression modeling. It relies only on the information already in your dataset and does not disturb valid values, which helps maintain the integrity of the data you already trust.

Takeaways

Compute the average time delta from rows that have both dates, apply that delta to generate candidate values in a copy, and use update with overwrite disabled to fill only the missing entries. This pattern is concise, keeps original values safe, and fits naturally into a pandas-based analysis workflow.