2025, Dec 06 13:00
Fill Missing Event Dates in Pandas by Averaging Treatment-to-Event Delta and Updating Only NaT
Learn how to impute missing event dates in pandas by averaging treatment-event time deltas and safely updating only NaT values with update overwrite=False.
Imputing event dates from reliable treatment timestamps is a common task in EDA and regression workflows, especially when the event field has gaps. If you already have both columns in pandas datetime and have computed the average gap between them, you can fill only the missing entries without touching the valid ones.
Problem setup
There are two datetime columns: one for the event that may be missing, and another for the treatment that is accurate. The goal is to compute the average time difference between treatment and event where both exist, then use that average to populate the missing event dates, leaving existing values intact.
Minimal example
Consider a small dataframe with several missing event dates:
>>> tbl
incident_dt therapy_dt
0 2025-02-20 2025-04-29
1 NaT 2025-05-22
2 NaT 2025-05-27
3 2025-03-14 2025-04-26
4 NaT 2025-05-11
5 2025-02-16 2025-04-23
6 NaT 2025-05-04
7 NaT 2025-05-20
8 2025-03-18 2025-05-28
9 2025-02-22 2025-05-21
What’s going on and why it matters
The dataset mixes complete rows with rows missing the event date. You don’t want to drop those rows because the column is important for analysis. Instead, compute an average time delta from rows that do have both dates, then apply that delta consistently to fill only the missing values. Using pandas for this lets you keep the original values and operate vectorially on thousands of rows.
Solution
First compute the average difference on rows where both dates are present. Then build a working copy of the two columns, set the event date in that copy to the treatment date plus the average delta, and finally push those filled values back into the original dataframe, updating only the missing entries.
>>> subset = tbl[tbl["incident_dt"].notna()]
>>> mean_delta = (subset["therapy_dt"] - subset["incident_dt"]).mean().round('d')
>>> mean_delta
Timedelta('67 days 00:00:00')
>>> tmp = tbl[["incident_dt", "therapy_dt"]].copy()
>>> tmp["incident_dt"] = tmp["therapy_dt"] + mean_delta
>>> tbl.update(tmp, overwrite=False)
>>> tbl
incident_dt therapy_dt
0 2025-02-20 2025-04-29
1 2025-07-28 2025-05-22
2 2025-08-02 2025-05-27
3 2025-03-14 2025-04-26
4 2025-07-17 2025-05-11
5 2025-02-16 2025-04-23
6 2025-07-10 2025-05-04
7 2025-07-26 2025-05-20
8 2025-03-18 2025-05-28
9 2025-02-22 2025-05-21
The key piece is update with overwrite set to False. That ensures existing, non-missing incident_dt values remain unchanged while NaT entries are filled from the copy. The round('d') call standardizes the average offset to whole days.
Why this is useful
This approach gives you a simple, reproducible imputation step that plays nicely with downstream EDA and regression modeling. It relies only on the information already in your dataset and does not disturb valid values, which helps maintain the integrity of the data you already trust.
Takeaways
Compute the average time delta from rows that have both dates, apply that delta to generate candidate values in a copy, and use update with overwrite disabled to fill only the missing entries. This pattern is concise, keeps original values safe, and fits naturally into a pandas-based analysis workflow.