https://pytroubles.com/en/posts/id1733-vectorized-pandas-compute-per-row-offsets-to-the-previous-b-event-using-cumsum-groupby-no-loops

Vectorized pandas: compute per-row offsets to the previous B event using cumsum groupby, no loops

Align asynchronous streams in pandas: subtract timestamps from the last B using vectorized cumsum

Vectorized pandas: compute per-row offsets to the previous B event using cumsum groupby, no loops

Learn a vectorized pandas technique to align asynchronous streams: compute A offsets to the previous B via boolean cumsum, groupby, and transform—no loops.

2025-11-07T11:00:08+03:00

2025-11-07T11:00:09+03:00

When two data streams tick at different rates, aligning them can be deceptively tricky. Imagine a mixed sequence of rows from sources A and B, and for every A event you need the time delta to the most recent preceding B. The data isn’t strictly periodic; there’s jitter, so a fixed stride or simple shift won’t help. The solution should avoid iteration and lean on vectorized pandas operations.Example datasetThe input contains rows from two sources with timestamps. For each A row we want the offset from the last B that happened before it.src,idx,ts B,1,20 A,1,100 A,2,200 A,3,300 B,2,320 A,4,400 A,5,500 A,6,600 B,3,620 The expected result keeps only A rows and shows ts as an offset relative to the previous B:src,idx,ts A,1,80 A,2,180 A,3,280 A,4,80 A,5,180 A,6,280 Why this is non-trivialWe need to compute a per-row difference where each A references the timestamp of the nearest earlier B. Because the cadence has jitter, you cannot rely on fixed window sizes or positional arithmetic. What you actually need is a partitioning of the sequence into contiguous sections that start at each B and extend up to (but not including) the next B, then subtract the first timestamp of each section from all timestamps within that section. Finally, drop the B rows to keep only A offsets.Vectorized approach in pandasThe key is to build a grouping key that increments every time a B appears. This can be done by comparing src to 'B' and taking a cumulative sum. Group by that key, take the first timestamp within each group, and subtract it from all timestamps in the same group. This resets the timeline to zero at every B and yields the desired offsets for the A rows. No explicit Python loops, no apply.Reproducible codeThe snippet below constructs the sample frame, performs the grouping, computes offsets, and filters out B rows.import pandas as pd records = pd.DataFrame( { "src": ["B", "A", "A", "A", "B", "A", "A", "A", "B"], "idx": [1, 1, 2, 3, 2, 4, 5, 6, 3], "ts": [20, 100, 200, 300, 320, 400, 500, 600, 620], } ) bucket = records["src"].eq("B").cumsum() records["ts"] = records["ts"] - records.groupby(bucket)["ts"].transform("first") result = records.loc[records["src"].ne("B"), ["src", "idx", "ts"]] print(result) How it worksThe expression src.eq('B').cumsum() produces an integer Series that increases by one at each B and stays constant for all following rows until the next B. Grouping by that Series partitions the data into sections that begin at B and end right before the next B. transform('first') broadcasts the first timestamp of each partition to all rows in that partition. Subtracting it from ts resets timestamps relative to the last B. Filtering out rows where src is 'B' leaves only the A offsets you want.Why this mattersIn time-series engineering and stream processing, it’s common to calculate offsets to synchronization markers or control events. Doing this with vectorized pandas operations is both concise and scalable. It avoids row-wise iteration and the overhead of apply, while staying readable and maintainable.TakeawaysWhen aligning asynchronous streams in pandas, think in terms of partitions defined by sentinel events. A cumulative sum over a boolean mask gives you a simple, robust grouping key. Within each partition, transform('first') lets you subtract a reference timestamp without leaving the vectorized path. And if you only need certain rows in the end, filter them after the computation rather than complicating the grouping logic.

pandas, vectorized pandas, groupby, cumsum, transform, time-series alignment, asynchronous streams, offset calculation, previous B event, time delta, python, no loops

2025

2025, Nov 07 11:00

Align asynchronous streams in pandas: subtract timestamps from the last B using vectorized cumsum

Learn a vectorized pandas technique to align asynchronous streams: compute A offsets to the previous B via boolean cumsum, groupby, and transform—no loops.

Example dataset

The input contains rows from two sources with timestamps. For each A row we want the offset from the last B that happened before it.

src,idx,ts
B,1,20
A,1,100
A,2,200
A,3,300
B,2,320
A,4,400
A,5,500
A,6,600
B,3,620

The expected result keeps only A rows and shows ts as an offset relative to the previous B:

src,idx,ts
A,1,80
A,2,180
A,3,280
A,4,80
A,5,180
A,6,280

Why this is non-trivial

We need to compute a per-row difference where each A references the timestamp of the nearest earlier B. Because the cadence has jitter, you cannot rely on fixed window sizes or positional arithmetic. What you actually need is a partitioning of the sequence into contiguous sections that start at each B and extend up to (but not including) the next B, then subtract the first timestamp of each section from all timestamps within that section. Finally, drop the B rows to keep only A offsets.

Vectorized approach in pandas

The key is to build a grouping key that increments every time a B appears. This can be done by comparing src to 'B' and taking a cumulative sum. Group by that key, take the first timestamp within each group, and subtract it from all timestamps in the same group. This resets the timeline to zero at every B and yields the desired offsets for the A rows. No explicit Python loops, no apply.

Reproducible code

The snippet below constructs the sample frame, performs the grouping, computes offsets, and filters out B rows.

import pandas as pd
records = pd.DataFrame(
    {
        "src": ["B", "A", "A", "A", "B", "A", "A", "A", "B"],
        "idx": [1, 1, 2, 3, 2, 4, 5, 6, 3],
        "ts": [20, 100, 200, 300, 320, 400, 500, 600, 620],
    }
)
bucket = records["src"].eq("B").cumsum()
records["ts"] = records["ts"] - records.groupby(bucket)["ts"].transform("first")
result = records.loc[records["src"].ne("B"), ["src", "idx", "ts"]]
print(result)

How it works

The expression src.eq('B').cumsum() produces an integer Series that increases by one at each B and stays constant for all following rows until the next B. Grouping by that Series partitions the data into sections that begin at B and end right before the next B. transform('first') broadcasts the first timestamp of each partition to all rows in that partition. Subtracting it from ts resets timestamps relative to the last B. Filtering out rows where src is 'B' leaves only the A offsets you want.

Why this matters

In time-series engineering and stream processing, it’s common to calculate offsets to synchronization markers or control events. Doing this with vectorized pandas operations is both concise and scalable. It avoids row-wise iteration and the overhead of apply, while staying readable and maintainable.

Takeaways

When aligning asynchronous streams in pandas, think in terms of partitions defined by sentinel events. A cumulative sum over a boolean mask gives you a simple, robust grouping key. Within each partition, transform('first') lets you subtract a reference timestamp without leaving the vectorized path. And if you only need certain rows in the end, filter them after the computation rather than complicating the grouping logic.

The article is based on a question from StackOverflow by Matt and an answer by mcsoini.

diff pandas python