https://pytroubles.com/en/posts/id2445-drop-rows-with-internal-nans-in-pandas-time-series-using-vectorized-masks-and-cummin

Drop rows with internal NaNs in pandas time series using vectorized masks and cummin

How to drop rows with internal NaNs in pandas time series using vectorized masks (isna + cummin)

Drop rows with internal NaNs in pandas time series using vectorized masks and cummin

Learn a fast, vectorized way to drop rows with internal NaNs in pandas time series using isna, cummin, and interpolate(limit_area='inside'). Boost performance.

2025-12-13T15:00:11+03:00

2025-12-13T15:00:12+03:00

Cleaning time series with leading and trailing gaps is routine, but occasional internal np.nan values are trickier. When a nan appears between valid numbers, it corrupts the continuous block and often needs to be removed. The goal is to scan each column, detect such internal gaps, and drop the entire row whenever a nan sits inside the data run rather than in the padding at the edges.Problem setupConsider a column that begins and ends with np.nan values, contains a continuous block of numbers, but has an unexpected internal gap. We want to find the np.nan between 9 and 13 and delete the corresponding row if, and only if, all three conditions are met: there is at least one valid value before, at least one valid value after, and the current value is nan.[np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan]Baseline approach (works, but slow)The following script satisfies the requirements by checking each cell in a column against the three conditions. It keeps a row unless the cell is an internal nan; however, it relies on Python loops and slicing, which makes it slow on large data.import pandas as pd import numpy as np payload = { 'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan], 'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan] } frame = pd.DataFrame(payload) def mark_valid_rows(series_obj): kept = [] for pos, val in series_obj.items(): has_before = not series_obj[:pos].isnull().all() has_after = not series_obj[pos + 1:].isnull().all() is_empty = np.isnan(val) kept.append(not (has_before and has_after and is_empty)) return kept for col_name in frame.columns: frame = frame[mark_valid_rows(frame[col_name])] print(frame)What’s really going onThe intended logic is sound: an internal nan is a nan that is not part of the leading or trailing block of nans. The performance issue comes from iterating Python-side over every element and repeatedly slicing the Series to check “anything before/after,” which is costly.We can express the same idea with vectorized masks. First, identify all nans. Then, flag those nans that belong to the outer padding — i.e., everything from the top until the first non-nan, and from the bottom until the first non-nan. Any nan that is not in those outer regions is an internal gap. Finally, keep rows where, for every column, the cell is either not a nan or is an outer nan.Vectorized solution with isna and cumminStart with a single column to make the masks explicit. We mark nans, then use cumulative minima in forward and reverse directions to label the “external” nans. We keep values that are either non-nan or external nan.import pandas as pd import numpy as np payload = { 'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan], 'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan] } table = pd.DataFrame(payload) nan_mask = table['A'].isna() edge_mask = (nan_mask.cummin() | nan_mask[::-1].cummin()) result_col = table.loc[edge_mask | ~nan_mask, 'A'] print(result_col)Output:0 NaN 1 NaN 2 NaN 3 1.0 4 4.0 5 6.0 6 6.0 7 9.0 9 13.0 10 NaN 11 NaN Name: A, dtype: float64Apply the same idea to the entire DataFrame. The only extra step is aggregating the per-column masks with all(axis=1) to keep only those rows that pass the rule in every column.import pandas as pd import numpy as np payload = { 'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan], 'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan] } matrix = pd.DataFrame(payload) nan_mask_df = matrix.isna() edge_mask_df = (nan_mask_df.cummin() | nan_mask_df[::-1].cummin()) filtered = matrix.loc[(edge_mask_df | ~nan_mask_df).all(axis=1)] print(filtered)Output: A B 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 1.0 11.0 4 4.0 3.0 5 6.0 16.0 6 6.0 13.0 9 13.0 12.0 10 NaN NaN 11 NaN NaNAlternative: interpolate with limit_area='inside'Another concise route is to use interpolate to detect internal gaps. After interpolation with limit_area='inside', only external nans remain as nans. This lets us build the same keep-mask.import pandas as pd import numpy as np payload = { 'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan], 'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan] } block = pd.DataFrame(payload) present_mask = block.notna() outer_nan_mask = block.interpolate(limit_area='inside').isna() filtered_alt = block[(present_mask | outer_nan_mask).all(axis=1)] print(filtered_alt)Why this mattersRow-level data quality checks are a common bottleneck in pipelines. Expressing “internal gap” detection with vectorized masks not only matches the intended semantics exactly, it also scales naturally to wide tables and long time series. The logic remains readable: identify nans, separate outer padding from internal gaps, and retain rows only when every column is safe.TakeawaysWhen the rule depends on relative position inside a column, look for mask algebra before reaching for Python loops. Combining isna with directional cummin cleanly separates edge padding from internal holes. If interpolation is acceptable in your workflow, limit_area='inside' provides another compact way to expose internal gaps while leaving external ones intact. Both approaches align with boolean indexing and preserve the precise behavior you need: drop rows with nan values that interrupt real data, keep the rest untouched.

pandas, time series, internal NaNs, drop rows, vectorized masks, cummin, isna, interpolate limit_area='inside', boolean indexing, data cleaning, performance, python

2025

2025, Dec 13 15:00

How to drop rows with internal NaNs in pandas time series using vectorized masks (isna + cummin)

Learn a fast, vectorized way to drop rows with internal NaNs in pandas time series using isna, cummin, and interpolate(limit_area='inside'). Boost performance.

Problem setup

Consider a column that begins and ends with np.nan values, contains a continuous block of numbers, but has an unexpected internal gap. We want to find the np.nan between 9 and 13 and delete the corresponding row if, and only if, all three conditions are met: there is at least one valid value before, at least one valid value after, and the current value is nan.

[np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan]

Baseline approach (works, but slow)

The following script satisfies the requirements by checking each cell in a column against the three conditions. It keeps a row unless the cell is an internal nan; however, it relies on Python loops and slicing, which makes it slow on large data.

import pandas as pd
import numpy as np
payload = {
    'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan],
    'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan]
}
frame = pd.DataFrame(payload)
def mark_valid_rows(series_obj):
    kept = []
    for pos, val in series_obj.items():
        has_before = not series_obj[:pos].isnull().all()
        has_after = not series_obj[pos + 1:].isnull().all()
        is_empty = np.isnan(val)
        kept.append(not (has_before and has_after and is_empty))
    return kept
for col_name in frame.columns:
    frame = frame[mark_valid_rows(frame[col_name])]
print(frame)

What’s really going on

The intended logic is sound: an internal nan is a nan that is not part of the leading or trailing block of nans. The performance issue comes from iterating Python-side over every element and repeatedly slicing the Series to check “anything before/after,” which is costly.

We can express the same idea with vectorized masks. First, identify all nans. Then, flag those nans that belong to the outer padding — i.e., everything from the top until the first non-nan, and from the bottom until the first non-nan. Any nan that is not in those outer regions is an internal gap. Finally, keep rows where, for every column, the cell is either not a nan or is an outer nan.

Vectorized solution with isna and cummin

Start with a single column to make the masks explicit. We mark nans, then use cumulative minima in forward and reverse directions to label the “external” nans. We keep values that are either non-nan or external nan.

import pandas as pd
import numpy as np
payload = {
    'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan],
    'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan]
}
table = pd.DataFrame(payload)
nan_mask = table['A'].isna()
edge_mask = (nan_mask.cummin() | nan_mask[::-1].cummin())
result_col = table.loc[edge_mask | ~nan_mask, 'A']
print(result_col)

Output:

0      NaN
1      NaN
2      NaN
3      1.0
4      4.0
5      6.0
6      6.0
7      9.0
9     13.0
10     NaN
11     NaN
Name: A, dtype: float64

Apply the same idea to the entire DataFrame. The only extra step is aggregating the per-column masks with all(axis=1) to keep only those rows that pass the rule in every column.

import pandas as pd
import numpy as np
payload = {
    'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan],
    'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan]
}
matrix = pd.DataFrame(payload)
nan_mask_df = matrix.isna()
edge_mask_df = (nan_mask_df.cummin() | nan_mask_df[::-1].cummin())
filtered = matrix.loc[(edge_mask_df | ~nan_mask_df).all(axis=1)]
print(filtered)

Output:

       A     B
0    NaN   NaN
1    NaN   NaN
2    NaN   NaN
3    1.0  11.0
4    4.0   3.0
5    6.0  16.0
6    6.0  13.0
9   13.0  12.0
10   NaN   NaN
11   NaN   NaN

Alternative: interpolate with limit_area='inside'

Another concise route is to use interpolate to detect internal gaps. After interpolation with limit_area='inside', only external nans remain as nans. This lets us build the same keep-mask.

import pandas as pd
import numpy as np
payload = {
    'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan],
    'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan]
}
block = pd.DataFrame(payload)
present_mask = block.notna()
outer_nan_mask = block.interpolate(limit_area='inside').isna()
filtered_alt = block[(present_mask | outer_nan_mask).all(axis=1)]
print(filtered_alt)

Why this matters

Row-level data quality checks are a common bottleneck in pipelines. Expressing “internal gap” detection with vectorized masks not only matches the intended semantics exactly, it also scales naturally to wide tables and long time series. The logic remains readable: identify nans, separate outer padding from internal gaps, and retain rows only when every column is safe.

Takeaways

When the rule depends on relative position inside a column, look for mask algebra before reaching for Python loops. Combining isna with directional cummin cleanly separates edge padding from internal holes. If interpolation is acceptable in your workflow, limit_area='inside' provides another compact way to expose internal gaps while leaving external ones intact. Both approaches align with boolean indexing and preserve the precise behavior you need: drop rows with nan values that interrupt real data, keep the rest untouched.

dataframe numpy pandas python