2025, Oct 04 19:00

Parsing Semi-Structured Logs in Python with a Sliding Lookbehind: Extract the Third Token Two Lines Up

Learn a Python method for parsing semi-structured logs: use a sliding lookbehind to extract the third token two lines above where the second token is truck.

Parsing semi-structured logs often requires looking backward or forward relative to a match. Here the task is precise: find every line where the second token equals truck, then extract the third token from the line that appears two lines earlier. The input is a simple text file with blocks of values separated by lines of plus signs, and the expected output is the list of marker values two lines above the matching rows.

Problem setup

Given an input like this, we need to return beta and delta because both occurrences of truck appear two lines after those markers:

apples    grapes    alpha   pears
chicago paris london 
yellow    blue      red
+++++++++++++++++++++
apples    grapes    beta   pears
chicago paris london 
car   truck  van
+++++++++++++++++++
apples    grapes    gamma   pears
chicago paris london 
white  purple   black
+++++++++++++++++++
apples    grapes    delta   pears
chicago paris london 
car   truck  van

Baseline attempt (why it falls short)

It’s natural to start by scanning the file and collecting rows where the second token equals truck, then pushing that into a pandas DataFrame. However, this approach only gathers the matching rows, not the values located two lines earlier. It doesn’t maintain the necessary context window.

import pandas as pd
rows_buffer = []
with open('input.txt', 'r') as fh:
    for record in fh:
        tokens = record.split()
        if len(tokens) > 1 and tokens[1] == "truck":
            rows_buffer.append(tokens)
frame = pd.DataFrame(rows_buffer)
print(frame.to_string)

This collects only the matching lines themselves. To extract the third token from two lines above, we need a lookbehind window across the stream of lines while reading.

What’s really happening

The requirement is positional and relative. When the current line’s second token equals truck, the needed value lives in the third token of the line that appeared two lines ago. That means we must preserve a sliding window of the recent lines as we iterate. Directly selecting rows into a DataFrame after the fact won’t help unless we also retain the surrounding context.

A precise solution with a lookbehind window

A compact and robust way to implement this is to keep a fixed-size sliding buffer that always holds the last three lines. The moment we see a line whose second token equals truck, we look back to the oldest entry in the buffer and extract its third token, if present. Lines with too few tokens (such as separator lines made of plus signs) are skipped for matching purposes, which avoids false positives.

#!/usr/bin/env python
import sys
from collections import deque
def run():
    if len(sys.argv) < 2:
        print("Usage: script_runner.py inputPath", file=sys.stderr)
        sys.exit(1)
    LOOKBACK = 3
    ring = deque(maxlen=LOOKBACK)
    hits = []
    with open(sys.argv[1], 'r') as src:
        for entry in src:
            parts = entry.strip().split()
            if len(parts) < 2:
                # not enough fields to be relevant for a match
                ring.append(entry)
                continue
            ring.append(entry)
            if len(ring) == LOOKBACK and parts[1] == "truck":
                target_line = ring[0]
                target_parts = target_line.split()
                if len(target_parts) >= 3:
                    hits.append(target_parts[2])
    if hits:
        print(hits)
        # Optional: convert to a DataFrame
        # import pandas as pd
        # df_out = pd.DataFrame(hits, columns=['lookbehind'])
        # print(df_out)
if __name__ == "__main__":
    run()

On the provided input, this prints ['beta', 'delta'], which matches the required output. The ring buffer maintains exactly the context we need and makes the “two lines above” lookup straightforward and safe.

Why this matters

Line-oriented processing frequently involves relative positions rather than absolute indexes: headers preceding payloads, trailers summarizing previous sections, or sentinel lines like truck in this case. Having a simple, deterministic lookbehind mechanism prevents misalignment and avoids brittle post-hoc indexing. It also naturally skips malformed or irrelevant lines by checking token counts before matching.

Takeaways

When a requirement depends on relative positioning across lines, keep the context in a small sliding window as you read. This avoids building large in-memory structures prematurely and ensures you can extract exactly what you need in one pass. If you still want a DataFrame later, convert the final list of extracted values after the streaming step.

The article is based on a question from StackOverflow by yodish and an answer by ticktalk.