2025, Nov 07 11:00
Align asynchronous streams in pandas: subtract timestamps from the last B using vectorized cumsum
Learn a vectorized pandas technique to align asynchronous streams: compute A offsets to the previous B via boolean cumsum, groupby, and transform—no loops.
When two data streams tick at different rates, aligning them can be deceptively tricky. Imagine a mixed sequence of rows from sources A and B, and for every A event you need the time delta to the most recent preceding B. The data isn’t strictly periodic; there’s jitter, so a fixed stride or simple shift won’t help. The solution should avoid iteration and lean on vectorized pandas operations.
Example dataset
The input contains rows from two sources with timestamps. For each A row we want the offset from the last B that happened before it.
src,idx,ts
B,1,20
A,1,100
A,2,200
A,3,300
B,2,320
A,4,400
A,5,500
A,6,600
B,3,620
The expected result keeps only A rows and shows ts as an offset relative to the previous B:
src,idx,ts
A,1,80
A,2,180
A,3,280
A,4,80
A,5,180
A,6,280
Why this is non-trivial
We need to compute a per-row difference where each A references the timestamp of the nearest earlier B. Because the cadence has jitter, you cannot rely on fixed window sizes or positional arithmetic. What you actually need is a partitioning of the sequence into contiguous sections that start at each B and extend up to (but not including) the next B, then subtract the first timestamp of each section from all timestamps within that section. Finally, drop the B rows to keep only A offsets.
Vectorized approach in pandas
The key is to build a grouping key that increments every time a B appears. This can be done by comparing src to 'B' and taking a cumulative sum. Group by that key, take the first timestamp within each group, and subtract it from all timestamps in the same group. This resets the timeline to zero at every B and yields the desired offsets for the A rows. No explicit Python loops, no apply.
Reproducible code
The snippet below constructs the sample frame, performs the grouping, computes offsets, and filters out B rows.
import pandas as pd
records = pd.DataFrame(
{
"src": ["B", "A", "A", "A", "B", "A", "A", "A", "B"],
"idx": [1, 1, 2, 3, 2, 4, 5, 6, 3],
"ts": [20, 100, 200, 300, 320, 400, 500, 600, 620],
}
)
bucket = records["src"].eq("B").cumsum()
records["ts"] = records["ts"] - records.groupby(bucket)["ts"].transform("first")
result = records.loc[records["src"].ne("B"), ["src", "idx", "ts"]]
print(result)
How it works
The expression src.eq('B').cumsum() produces an integer Series that increases by one at each B and stays constant for all following rows until the next B. Grouping by that Series partitions the data into sections that begin at B and end right before the next B. transform('first') broadcasts the first timestamp of each partition to all rows in that partition. Subtracting it from ts resets timestamps relative to the last B. Filtering out rows where src is 'B' leaves only the A offsets you want.
Why this matters
In time-series engineering and stream processing, it’s common to calculate offsets to synchronization markers or control events. Doing this with vectorized pandas operations is both concise and scalable. It avoids row-wise iteration and the overhead of apply, while staying readable and maintainable.
Takeaways
When aligning asynchronous streams in pandas, think in terms of partitions defined by sentinel events. A cumulative sum over a boolean mask gives you a simple, robust grouping key. Within each partition, transform('first') lets you subtract a reference timestamp without leaving the vectorized path. And if you only need certain rows in the end, filter them after the computation rather than complicating the grouping logic.