https://pytroubles.com/en/posts/id19-address-normalization-for-real-estate-scraping-with-python-regex-clean-deduplicate-standardize

Address Normalization for Real Estate Scraping with Python Regex: Clean, Deduplicate, Standardize

Normalize and Deduplicate Real Estate Addresses in Python with Regex: Clean, Readable Results

Address Normalization for Real Estate Scraping with Python Regex: Clean, Deduplicate, Standardize

Learn a Python regex method for address normalization in real estate scraping: unify delimiters, standardize suffixes, and deduplicate by containment.

2025-09-18T17:00:04+03:00

2025-09-18T17:00:05+03:00

Scraping real estate listings often looks simple until you hit address normalization. Minor inconsistencies add up: some records include pipes instead of commas, others repeat the street line with different suffixes, and the same address can appear twice in slightly different forms. A typical example is a record like"747 Geary Street, 747 Geary St, Oakland, CA 94609"which should be consolidated into a single, clean address without duplicated parts.Python example that cleans the sample inputsimport re def unify_address(text): suffix_aliases = { r'\bStreet\b': 'St', r'\bAvenue\b': 'Ave', r'\bRoad\b': 'Rd', r'\bBoulevard\b': 'Blvd', r'\bDrive\b': 'Dr', r'\bLane\b': 'Ln', r'\bCourt\b': 'Ct', } draft = text.replace('|', ',') for rex, short in suffix_aliases.items(): draft = re.sub(rex, short, draft, flags=re.IGNORECASE) segments = [frag.strip() for frag in draft.split(',')] pruned = [] for idx, frag in enumerate(segments): if not any(idx < j and frag in other for j, other in enumerate(segments)): pruned.append(frag) return ', '.join(pruned) addresses_to_clean = [ 'The Gantry | 1340 3rd St, San Francisco, CA', '845 Sutter, 845 Sutter St APT 509, San Francisco, CA', '1350 Washington Street | 1350 Washington St, San Francisco, CA', 'Parkmerced 3711 19th Ave, San Francisco, CA', '747 Geary Street, 747 Geary St, Oakland, CA 94609' ] normalized = [unify_address(item) for item in addresses_to_clean] for src, dst in zip(addresses_to_clean, normalized): print(f"Original: {src}") print(f"Cleaned: {dst}") print() This produces cleaned lines such as"The Gantry, 1340 3rd St, San Francisco, CA""845 Sutter St APT 509, San Francisco, CA""1350 Washington St, San Francisco, CA""Parkmerced 3711 19th Ave, San Francisco, CA""747 Geary St, Oakland, CA 94609"What makes these addresses hard to cleanSimple string tools struggle because the duplicates are not byte-for-byte identical. One version might have “Street”, the other “St”. If you split by commas and try to remove repeats with a naive equality check, the duplicate stays. If you go aggressive with replace(), you risk removing valid content like the city or state. In other words, it’s not only a delimiter problem; it’s a normalization and containment problem.The approach that works on the examplesThe solution proceeds in three deliberate steps. First, it unifies delimiters by converting a pipe into a comma. That consolidates how you split the address into components. Second, it normalizes common street suffixes to a single form using regular expressions. With “Street” and “St” brought to the same representation, potential duplicates become detectable. Third, it removes repeated parts by checking containment across the comma-separated pieces. If an earlier piece is fully contained in a later piece, the earlier one is dropped, keeping the more informative fragment.This containment rule is the critical piece: in an address like “845 Sutter, 845 Sutter St APT 509, San Francisco, CA”, the shorter “845 Sutter” is entirely contained in the fuller “845 Sutter St APT 509”, so only the latter is kept. Likewise for the “Washington Street / Washington St” case after normalization.Why this is worth knowingReal-world data isn’t tidy, and web scraping highlights that immediately. Address normalization is a classic example where minor inconsistencies undermine downstream tasks like matching, aggregation, or geocoding. While AI could handle unseen outliers more flexibly, building a clear, deterministic pass like this strengthens your fundamentals and gives you a baseline you can reason about and extend. When your test set grows, you can add more suffix variants or tweak containment rules without changing the overall structure.TakeawaysStart by making delimiters consistent, then bring close variants to a single representation, and only then deduplicate by containment. That sequence prevents overzealous replacements and avoids losing important context like the city, state, or ZIP. For the provided samples, this technique yields compact, readable addresses and demonstrates a maintainable way to clean scraped text without relying on a single split() or replace().

address normalization, real estate scraping, Python regex, clean addresses, deduplicate addresses, street suffix normalization, unify delimiters, data cleaning, geocoding prep

2025

2025, Sep 18 17:00

Normalize and Deduplicate Real Estate Addresses in Python with Regex: Clean, Readable Results

Learn a Python regex method for address normalization in real estate scraping: unify delimiters, standardize suffixes, and deduplicate by containment.

"747 Geary Street, 747 Geary St, Oakland, CA 94609"

which should be consolidated into a single, clean address without duplicated parts.

Python example that cleans the sample inputs

import re
def unify_address(text):
    suffix_aliases = {
        r'\bStreet\b': 'St',
        r'\bAvenue\b': 'Ave',
        r'\bRoad\b': 'Rd',
        r'\bBoulevard\b': 'Blvd',
        r'\bDrive\b': 'Dr',
        r'\bLane\b': 'Ln',
        r'\bCourt\b': 'Ct',
    }
    draft = text.replace('|', ',')
    for rex, short in suffix_aliases.items():
        draft = re.sub(rex, short, draft, flags=re.IGNORECASE)
    segments = [frag.strip() for frag in draft.split(',')]
    pruned = []
    for idx, frag in enumerate(segments):
        if not any(idx < j and frag in other for j, other in enumerate(segments)):
            pruned.append(frag)
    return ', '.join(pruned)
addresses_to_clean = [
    'The Gantry | 1340 3rd St, San Francisco, CA',
    '845 Sutter, 845 Sutter St APT 509, San Francisco, CA',
    '1350 Washington Street | 1350 Washington St, San Francisco, CA',
    'Parkmerced 3711 19th Ave, San Francisco, CA',
    '747 Geary Street, 747 Geary St, Oakland, CA 94609'
]
normalized = [unify_address(item) for item in addresses_to_clean]
for src, dst in zip(addresses_to_clean, normalized):
    print(f"Original: {src}")
    print(f"Cleaned:  {dst}")
    print()

This produces cleaned lines such as

"The Gantry, 1340 3rd St, San Francisco, CA"

"845 Sutter St APT 509, San Francisco, CA"

"1350 Washington St, San Francisco, CA"

"Parkmerced 3711 19th Ave, San Francisco, CA"

"747 Geary St, Oakland, CA 94609"

What makes these addresses hard to clean

Simple string tools struggle because the duplicates are not byte-for-byte identical. One version might have “Street”, the other “St”. If you split by commas and try to remove repeats with a naive equality check, the duplicate stays. If you go aggressive with replace(), you risk removing valid content like the city or state. In other words, it’s not only a delimiter problem; it’s a normalization and containment problem.

The approach that works on the examples

The solution proceeds in three deliberate steps. First, it unifies delimiters by converting a pipe into a comma. That consolidates how you split the address into components. Second, it normalizes common street suffixes to a single form using regular expressions. With “Street” and “St” brought to the same representation, potential duplicates become detectable. Third, it removes repeated parts by checking containment across the comma-separated pieces. If an earlier piece is fully contained in a later piece, the earlier one is dropped, keeping the more informative fragment.

This containment rule is the critical piece: in an address like “845 Sutter, 845 Sutter St APT 509, San Francisco, CA”, the shorter “845 Sutter” is entirely contained in the fuller “845 Sutter St APT 509”, so only the latter is kept. Likewise for the “Washington Street / Washington St” case after normalization.

Why this is worth knowing

Real-world data isn’t tidy, and web scraping highlights that immediately. Address normalization is a classic example where minor inconsistencies undermine downstream tasks like matching, aggregation, or geocoding. While AI could handle unseen outliers more flexibly, building a clear, deterministic pass like this strengthens your fundamentals and gives you a baseline you can reason about and extend. When your test set grows, you can add more suffix variants or tweak containment rules without changing the overall structure.

Takeaways

Start by making delimiters consistent, then bring close variants to a single representation, and only then deduplicate by containment. That sequence prevents overzealous replacements and avoids losing important context like the city, state, or ZIP. For the provided samples, this technique yields compact, readable addresses and demonstrates a maintainable way to clean scraped text without relying on a single split() or replace().

The article is based on a question from StackOverflow by Adamzam15 and an answer by André.

beautifulsoup python string web-scraping