2025, Nov 24 07:00

Hourly Resampling with Safe Interpolation in Pandas: Split Time Series into Contiguous Segments

Learn how to resample to hourly and interpolate only within contiguous blocks in Pandas. Detect gaps with index.diff, group by cumsum, and avoid long-gap fills.

When you resample a time series to hourly frequency, interpolation is a convenient way to fill missing timestamps. The hitch appears when gaps are too wide and you want to avoid fabricating values across those long stretches. The goal is to split the data into contiguous hourly segments, interpolate only within each contiguous block, and leave large breaks as separators between segments.

Problem setup

Consider a time-indexed DataFrame with missing hours. A straightforward resample plus interpolate fills every hole, including large ones. For example:

frame = frame.resample('1h').first()
frame = frame.interpolate(method='time')
frame

This produces hourly values through the entire range, even over multi-hour gaps, which you might not want.

Why this happens

Resampling creates a regular hourly index, and time-based interpolation computes values between the nearest known points using their timestamps. By design, it does not know that a four or five hour break should be treated as a boundary. If the business rule says “do not interpolate across gaps longer than 3 hours,” you need a way to detect those gaps first and only interpolate within contiguous runs.

Strategy: split by large breaks, then resample and interpolate within each part

The index delta between adjacent rows reveals where the big jumps are. Compute the difference of consecutive timestamps, compare it to your chosen threshold, and accumulate groups with a running sum. Each group is a contiguous segment. Resample and interpolate per segment to fill just the short gaps.

Complete solution

The following snippet creates segments at gaps strictly greater than three hours and returns either a dictionary of hourly-interpolated DataFrames or a list you can concatenate later.

limit = pd.Timedelta(hours=3)
markers = (frame.index.diff() > limit)
parts_map = {f'part{ix}': chunk.resample('1h').first().interpolate(method='time')
             for ix, (grp, chunk) in enumerate(frame.groupby(markers.cumsum()))}

If you prefer a list that can be joined back later:

segments = [chunk.resample('1h').first().interpolate(method='time')
            for grp, chunk in frame.groupby(markers.cumsum())]

This yields the expected three segments for the example data, now hourly and interpolated only within contiguous blocks:

{'part0':                         A
 2023-03-18 05:00:00   3.0
 2023-03-18 06:00:00   4.0
 2023-03-18 07:00:00  24.4,
 'part1':                         A
 2023-03-18 12:00:00  5.60
 2023-03-18 13:00:00  3.40
 2023-03-18 14:00:00  3.95
 2023-03-18 15:00:00  4.50,
 'part2':                        A
 2023-03-18 20:00:00  8.8
 2023-03-18 21:00:00  3.2}

How the grouping works

The core idea is simple. The index difference highlights where breaks exceed the threshold. Converting those booleans into groups with a cumulative sum produces contiguous labels that can be fed into groupby. Within each group, timestamps are close enough to justify interpolation after resampling. Across groups, nothing is interpolated because the boundary is preserved by the segmentation.

Why this is worth knowing

Regularizing time series is a common preprocessing step for modeling, monitoring, and analytics. Blindly interpolating through long outages or capture gaps can create misleading artifacts. Segmenting by gap size lets you keep the convenience of resample and interpolate while respecting data quality constraints. It also simplifies downstream logic: each segment is a clean, hourly-aligned block with realistic fills.

Practical takeaways

Detect large jumps with index.diff and compare to a Timedelta threshold. Turn those jumps into contiguous group labels using cumsum. Apply resample and time-based interpolate inside each group only. If you need to reassemble the series, keep a list and concatenate later; if you need separate artifacts, keep a dictionary keyed by segment index. This keeps interpolation honest and your hourly data trustworthy.