https://pytroubles.com/en/posts/id1194-rdkit-replacesubstructs-vs-smiles-order-preserve-atom-indices-and-layout-with-a-regex-fix

RDKit ReplaceSubstructs vs SMILES Order: Preserve Atom Indices and Layout with a Regex Fix

Keep SMILES Atom Order and Mapping When Using RDKit ReplaceSubstructs: A Practical Regex Fix

RDKit ReplaceSubstructs vs SMILES Order: Preserve Atom Indices and Layout with a Regex Fix

Learn why RDKit ReplaceSubstructs reorders atoms and mapping, and how to preserve SMILES order, atom indices, and layout with a simple regex swap approach.

2025-10-21T06:00:06+03:00

2025-10-21T06:00:07+03:00

When you try to stitch two SMILES together with RDKit’s ReplaceSubstructs(), it’s easy to run into an unexpected side effect: the replacement “works”, but the atom order and mapping appear shuffled. If your downstream visualization depends on the exact textual layout or atom indices, that’s a problem. Below is a precise look at what happens and a pragmatic way to keep the original group layout intact for display.Problem setupThe goal is to replace a carbon atom in one SMILES with a second SMILES fragment via ReplaceSubstructs(). In practice, no matter which option you pick from the returned list, RDKit ends up reordering atoms and the indices look reshuffled. Here is a minimal example using RDKit that demonstrates the behavior.from rdkit import Chem def fuse_smiles(base_smi, insert_smi, choice_idx=1): base_mol = Chem.MolFromSmiles(base_smi) frag_mol = Chem.MolFromSmiles(insert_smi) if base_mol is None or frag_mol is None: raise ValueError("One or both SMILES strings are invalid.") tagged_base = add_atom_map_tags(base_mol) result_mol = Chem.ReplaceSubstructs( tagged_base, Chem.MolFromSmarts("[CH3]"), frag_mol )[choice_idx] result_smi = Chem.MolToSmiles(result_mol) print(result_smi) def add_atom_map_tags(m): n_atoms = m.GetNumAtoms() for aidx in range(n_atoms): m.GetAtomWithIdx(aidx).SetProp( 'molAtomMapNumber', str(m.GetAtomWithIdx(aidx).GetIdx()) ) return m if __name__ == "__main__": base = "CC(C)(C)Cl" insert = "CN(C)C" fuse_smiles(base, insert) Varying the option index produces outputs like:CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] for option 0CN(C)C[C:1]([CH3:0])([CH3:3])[Cl:4] for option 1CN(C)C[C:1]([CH3:0])([CH3:2])[Cl:4] for option 2The desired display, however, preserves the original grouping and order as typed:CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] for option 0[CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4] for option 1[CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4] for option 2What’s going onRDKit is replacing the expected substructures. The effect you see is that RDKit seems to change or reorder atoms internally to represent what it considers a valid or canonical form. That’s fine for cheminformatics tasks, but not if you must preserve the exact textual order and branch layout for visual output where indices and positions matter.Practical workaround: string-level substitution with regexIf the requirement is display-first and you need to preserve the original SMILES text layout, treat the SMILES as a string and substitute the targeted tokens, instead of relying on RDKit to maintain your original branch order. The idea is to find all occurrences of [CH3:number] and replace each occurrence with the second fragment, producing distinct variants for each potential replacement position—all while keeping the rest of the text unchanged.import re template_str = "[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]" hits = re.findall(r'\[CH3:\d+\]', template_str) print('hits:', hits) variants = [] for token in hits: replaced_str = template_str.replace(token, 'CN(C)C') variants.append(replaced_str) print('variant:', replaced_str) choice_idx = 0 print(choice_idx, variants[choice_idx]) This yields:hits: ['[CH3:0]', '[CH3:2]', '[CH3:3]']variant: CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]variant: [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]variant: [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]To reuse this in your application, wrap it in a small utility:import re def substitute_tokens(src, token_pattern, payload): found = re.findall(token_pattern, src) out = [] for token in found: out.append(src.replace(token, payload)) return out # --- usage --- source = "[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]" token_re = r'\[CH3:\d+\]' payload = 'CN(C)C' alts = substitute_tokens(source, token_re, payload) for choice_idx in range(3): print(choice_idx, alts[choice_idx]) Output:0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]1 [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]2 [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]Why this mattersIn data products where SMILES text is used directly for visualization, preserving the exact order and branches can be as important as chemical correctness. If your renderer or mapping logic depends on atom positions or labels as written, you don’t want the toolkit to reorder them behind the scenes. The string-level approach lets you keep the original layout while still offering multiple replacement choices.TakeawaysIf you need chemically aware edits, ReplaceSubstructs() does the job but may reorder atoms. If you need the original SMILES layout for display and control over which [CH3:number] gets replaced, use a string-based approach with regex and generate the variants you want. Keep an eye on what your code is actually producing with simple print debugging and make sure your script is properly indented and structured so you can trust the outputs you inspect.

RDKit ReplaceSubstructs, SMILES order, atom indices, atom mapping, preserve layout, regex substitution, string-based replacement, cheminformatics visualization, SMILES mapping, canonicalization

2025

2025, Oct 21 06:00

Keep SMILES Atom Order and Mapping When Using RDKit ReplaceSubstructs: A Practical Regex Fix

Learn why RDKit ReplaceSubstructs reorders atoms and mapping, and how to preserve SMILES order, atom indices, and layout with a simple regex swap approach.

Problem setup

The goal is to replace a carbon atom in one SMILES with a second SMILES fragment via ReplaceSubstructs(). In practice, no matter which option you pick from the returned list, RDKit ends up reordering atoms and the indices look reshuffled. Here is a minimal example using RDKit that demonstrates the behavior.

from rdkit import Chem
def fuse_smiles(base_smi, insert_smi, choice_idx=1):
    base_mol = Chem.MolFromSmiles(base_smi)
    frag_mol = Chem.MolFromSmiles(insert_smi)
    if base_mol is None or frag_mol is None:
        raise ValueError("One or both SMILES strings are invalid.")
    tagged_base = add_atom_map_tags(base_mol)
    result_mol = Chem.ReplaceSubstructs(
        tagged_base,
        Chem.MolFromSmarts("[CH3]"),
        frag_mol
    )[choice_idx]
    result_smi = Chem.MolToSmiles(result_mol)
    print(result_smi)
def add_atom_map_tags(m):
    n_atoms = m.GetNumAtoms()
    for aidx in range(n_atoms):
        m.GetAtomWithIdx(aidx).SetProp(
            'molAtomMapNumber', str(m.GetAtomWithIdx(aidx).GetIdx())
        )
    return m
if __name__ == "__main__":
    base = "CC(C)(C)Cl"
    insert = "CN(C)C"
    fuse_smiles(base, insert)

Varying the option index produces outputs like:

CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] for option 0
CN(C)C[C:1]([CH3:0])([CH3:3])[Cl:4] for option 1
CN(C)C[C:1]([CH3:0])([CH3:2])[Cl:4] for option 2

The desired display, however, preserves the original grouping and order as typed:

CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] for option 0
[CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4] for option 1
[CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4] for option 2

What’s going on

RDKit is replacing the expected substructures. The effect you see is that RDKit seems to change or reorder atoms internally to represent what it considers a valid or canonical form. That’s fine for cheminformatics tasks, but not if you must preserve the exact textual order and branch layout for visual output where indices and positions matter.

Practical workaround: string-level substitution with regex

If the requirement is display-first and you need to preserve the original SMILES text layout, treat the SMILES as a string and substitute the targeted tokens, instead of relying on RDKit to maintain your original branch order. The idea is to find all occurrences of [CH3:number] and replace each occurrence with the second fragment, producing distinct variants for each potential replacement position—all while keeping the rest of the text unchanged.

import re
template_str = "[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]"
hits = re.findall(r'\[CH3:\d+\]', template_str)
print('hits:', hits)
variants = []
for token in hits:
    replaced_str = template_str.replace(token, 'CN(C)C')
    variants.append(replaced_str)
    print('variant:', replaced_str)
choice_idx = 0
print(choice_idx, variants[choice_idx])

This yields:

hits: ['[CH3:0]', '[CH3:2]', '[CH3:3]']
variant: CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
variant: [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]
variant: [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]
0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]

To reuse this in your application, wrap it in a small utility:

import re
def substitute_tokens(src, token_pattern, payload):
    found = re.findall(token_pattern, src)
    out = []
    for token in found:
        out.append(src.replace(token, payload))
    return out
# --- usage ---
source = "[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]"
token_re = r'\[CH3:\d+\]'
payload = 'CN(C)C'
alts = substitute_tokens(source, token_re, payload)
for choice_idx in range(3):
    print(choice_idx, alts[choice_idx])

Output:

0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
1 [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]
2 [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]

Why this matters

In data products where SMILES text is used directly for visualization, preserving the exact order and branches can be as important as chemical correctness. If your renderer or mapping logic depends on atom positions or labels as written, you don’t want the toolkit to reorder them behind the scenes. The string-level approach lets you keep the original layout while still offering multiple replacement choices.

Takeaways

If you need chemically aware edits, ReplaceSubstructs() does the job but may reorder atoms. If you need the original SMILES layout for display and control over which [CH3:number] gets replaced, use a string-based approach with regex and generate the variants you want. Keep an eye on what your code is actually producing with simple print debugging and make sure your script is properly indented and structured so you can trust the outputs you inspect.

The article is based on a question from StackOverflow by Anh Vu and an answer by furas.

cheminformatics chemistry python rdkit