2025, Nov 20 05:00
Generate Compact Glob-Like Patterns for Log Messages with Python difflib, Clustering, and Caching
Learn how to cluster log lines and derive compact, glob-like signatures in Python using difflib, with exhaustive variant, deduplication, and caching for speed.
When you bucket similar log messages, showing a representative example from the bucket is rarely enough. What you usually want is a compact signature that preserves the shared text and replaces the variable parts with a wildcard. In other words, a glob-like pattern that matches every message in the bucket while maximizing the amount of common characters.
Example goal
Consider these log lines collected in one bucket:
Error: [MODULE_FOO] File foo has 5 unsolved dependencies and 4 errors.
Error: [MODULE_BLA] Files bar and yaz have 123 unsolved dependencies.
Error: [MODULE_123] File baz has 45 unsolved dependencies and 3 warnings.
A useful common pattern would be:
Error: [MODULE_*] File* * ha* * unsolved dependencies*.
This keeps stable fragments intact and collapses differences into asterisks.
Clustering first
If you are still at the stage of grouping similar lines, one pragmatic way is using difflib as a similarity heuristic. Here is a compact routine that groups strings by a quick similarity threshold.
import difflib
def group_lines(items: list[str], cutoff: float = 0.9) -> list[list[str]]:
buckets: list[list[str]] = []
bidx = 0
while len(items) > 0:
pivot = items.pop()
buckets.append([pivot])
drop: list[int] = []
for j, peer in enumerate(items):
if difflib.SequenceMatcher(None, pivot, peer).quick_ratio() > cutoff:
buckets[bidx].append(peer)
drop.append(j)
items = [v for j, v in enumerate(items) if j not in drop]
bidx += 1
return buckets
Once a bucket is formed, the question becomes how to derive a glob-like signature for all its strings.
How to derive a shared pattern with difflib
The core idea is straightforward. Take a seed string as the current signature. For each other string in the bucket, compute the diff opcodes with difflib and replace every non-equal span in the current signature with an asterisk. Iterate this fold across the list. Reversing the opcodes ensures slice indexes remain valid as you edit the string.
Baseline approach
This version is intentionally simple. It is fast and tends to be good enough in practice, but it can miss the best possible pattern in certain corner cases such as ["ab", "ba", "cb", "bc"], where it yields "*" instead of "*b*".
import difflib
def derive_glob(shards: list[str]) -> str:
if len(shards) == 0:
return ""
signature = shards[0]
if len(shards) == 1:
return signature
for probe in shards[1:]:
for op, a1, a2, _b1, _b2 in reversed(
difflib.SequenceMatcher(None, signature, probe).get_opcodes()
):
if op != "equal":
signature = signature[:a1] + "*" + signature[a2:]
return signature
Applying it to the earlier example produces:
Error: [MODULE_*] File* * ha* * unsolved dependencies*.
More exhaustive variant
The previous method’s result can depend on which string you pick as the seed. To improve robustness, run the same fold multiple times, each time starting from a different element, and pick the longest resulting pattern. This mitigates the corner cases at the cost of extra work.
import difflib
def derive_glob_exhaustive(shards: list[str]) -> str:
if len(shards) == 0:
return ""
if len(shards) == 1:
return shards[0]
candidates: set[str] = set()
for k, seed in enumerate(shards):
candidates.add(derive_glob([seed] + shards[:k] + shards[k + 1 :]))
return sorted(candidates, key=len, reverse=True)[0]
This approach is slower, but it fixes cases where a different seed leads to a stronger pattern.
Scaling up: deduplication and caching
For larger buckets, two straightforward optimizations help a lot. First, remove duplicate strings before processing. Second, cache the intermediate fold results for pairs of (current signature, next string). In one test with roughly a thousand inputs, this brought runtime down from about 17 seconds to about 0.20 seconds.
import difflib
import copy
_smemo: dict[tuple[str, str], str] = {}
def _derive_glob_cached_inner(seq: list[str]) -> str:
signature = seq[0]
for probe in seq[1:]:
if (signature, probe) not in _smemo:
before = copy.copy(signature)
for op, a1, a2, _b1, _b2 in reversed(
difflib.SequenceMatcher(None, signature, probe).get_opcodes()
):
if op != "equal":
signature = signature[:a1] + "*" + signature[a2:]
_smemo[(before, probe)] = signature
else:
signature = _smemo[(signature, probe)]
return signature
def derive_glob_exhaustive_cached(shards: list[str]) -> str:
uniq = list(set(shards))
if len(uniq) == 0:
return ""
if len(uniq) == 1:
return uniq[0]
outcomes: list[str] = []
for k, seed in enumerate(uniq):
ordered = [seed] + uniq[:k] + uniq[k + 1 :]
outcomes.append(_derive_glob_cached_inner(ordered))
_smemo.clear()
return max(outcomes, key=len)
What the patterns capture—and what they don’t
The pattern generation works at the character level. That means it can align parts of words instead of whole tokens, which is desirable here. For instance, "have" and "has" converge to "ha*". It also does not assume that all common fragments line up at the same positions, which is important when your log templates are not perfectly aligned. In tricky inputs like ["ab", "ba", "cb", "bc"], the seed matters for the baseline, which is why the exhaustive variant can produce a better pattern such as "*b*" instead of a blanket "*".
Why this matters for log analysis
A readable, compact signature makes buckets understandable at a glance. It avoids the unhelpful "*" that technically matches everything but communicates nothing. Keeping the shared substrings verbatim while collapsing the rest pierces through IDs, counts, and filenames to reveal the stable template behind the noise.
Putting it together
Here is a minimal end-to-end demonstration using the earlier sample lines.
lines = [
"Error: [MODULE_FOO] File foo has 5 unsolved dependencies and 4 errors.",
"Error: [MODULE_BLA] Files bar and yaz have 123 unsolved dependencies.",
"Error: [MODULE_123] File baz has 45 unsolved dependencies and 3 warnings.",
]
pattern_simple = derive_glob(lines)
pattern_stronger = derive_glob_exhaustive_cached(lines)
print(pattern_simple)
print(pattern_stronger)
Both produce a pattern that keeps the invariant text and replaces the variable segments with asterisks.
Takeaways
When you need a glob-like signature for a bucket of log messages, difflib provides just enough building blocks to get it done. The quick single-pass version is often sufficient. If you encounter edge cases where the start point biases the result, switch to the exhaustive version. If runtime becomes an issue, eliminate duplicates first and cache the folding steps.
This approach preserves exactly what is stable across the bucket and abstracts the rest—precisely what you need for human-friendly log diagnostics.