2025, Oct 21 06:00
Keep SMILES Atom Order and Mapping When Using RDKit ReplaceSubstructs: A Practical Regex Fix
Learn why RDKit ReplaceSubstructs reorders atoms and mapping, and how to preserve SMILES order, atom indices, and layout with a simple regex swap approach.
When you try to stitch two SMILES together with RDKit’s ReplaceSubstructs(), it’s easy to run into an unexpected side effect: the replacement “works”, but the atom order and mapping appear shuffled. If your downstream visualization depends on the exact textual layout or atom indices, that’s a problem. Below is a precise look at what happens and a pragmatic way to keep the original group layout intact for display.
Problem setup
The goal is to replace a carbon atom in one SMILES with a second SMILES fragment via ReplaceSubstructs(). In practice, no matter which option you pick from the returned list, RDKit ends up reordering atoms and the indices look reshuffled. Here is a minimal example using RDKit that demonstrates the behavior.
from rdkit import Chem
def fuse_smiles(base_smi, insert_smi, choice_idx=1):
    base_mol = Chem.MolFromSmiles(base_smi)
    frag_mol = Chem.MolFromSmiles(insert_smi)
    if base_mol is None or frag_mol is None:
        raise ValueError("One or both SMILES strings are invalid.")
    tagged_base = add_atom_map_tags(base_mol)
    result_mol = Chem.ReplaceSubstructs(
        tagged_base,
        Chem.MolFromSmarts("[CH3]"),
        frag_mol
    )[choice_idx]
    result_smi = Chem.MolToSmiles(result_mol)
    print(result_smi)
def add_atom_map_tags(m):
    n_atoms = m.GetNumAtoms()
    for aidx in range(n_atoms):
        m.GetAtomWithIdx(aidx).SetProp(
            'molAtomMapNumber', str(m.GetAtomWithIdx(aidx).GetIdx())
        )
    return m
if __name__ == "__main__":
    base = "CC(C)(C)Cl"
    insert = "CN(C)C"
    fuse_smiles(base, insert)
Varying the option index produces outputs like:
CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] for option 0
CN(C)C[C:1]([CH3:0])([CH3:3])[Cl:4] for option 1
CN(C)C[C:1]([CH3:0])([CH3:2])[Cl:4] for option 2
The desired display, however, preserves the original grouping and order as typed:
CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4] for option 0
[CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4] for option 1
[CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4] for option 2
What’s going on
RDKit is replacing the expected substructures. The effect you see is that RDKit seems to change or reorder atoms internally to represent what it considers a valid or canonical form. That’s fine for cheminformatics tasks, but not if you must preserve the exact textual order and branch layout for visual output where indices and positions matter.
Practical workaround: string-level substitution with regex
If the requirement is display-first and you need to preserve the original SMILES text layout, treat the SMILES as a string and substitute the targeted tokens, instead of relying on RDKit to maintain your original branch order. The idea is to find all occurrences of [CH3:number] and replace each occurrence with the second fragment, producing distinct variants for each potential replacement position—all while keeping the rest of the text unchanged.
import re
template_str = "[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]"
hits = re.findall(r'\[CH3:\d+\]', template_str)
print('hits:', hits)
variants = []
for token in hits:
    replaced_str = template_str.replace(token, 'CN(C)C')
    variants.append(replaced_str)
    print('variant:', replaced_str)
choice_idx = 0
print(choice_idx, variants[choice_idx])
This yields:
hits: ['[CH3:0]', '[CH3:2]', '[CH3:3]']
variant: CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
variant: [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]
variant: [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]
0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
To reuse this in your application, wrap it in a small utility:
import re
def substitute_tokens(src, token_pattern, payload):
    found = re.findall(token_pattern, src)
    out = []
    for token in found:
        out.append(src.replace(token, payload))
    return out
# --- usage ---
source = "[CH3:0][C:1]([CH3:2])([CH3:3])[Cl:4]"
token_re = r'\[CH3:\d+\]'
payload = 'CN(C)C'
alts = substitute_tokens(source, token_re, payload)
for choice_idx in range(3):
    print(choice_idx, alts[choice_idx])
Output:
0 CN(C)C[C:1]([CH3:2])([CH3:3])[Cl:4]
1 [CH3:0][C:1](CN(C)C)([CH3:3])[Cl:4]
2 [CH3:0][C:1]([CH3:2])(CN(C)C)[Cl:4]
Why this matters
In data products where SMILES text is used directly for visualization, preserving the exact order and branches can be as important as chemical correctness. If your renderer or mapping logic depends on atom positions or labels as written, you don’t want the toolkit to reorder them behind the scenes. The string-level approach lets you keep the original layout while still offering multiple replacement choices.
Takeaways
If you need chemically aware edits, ReplaceSubstructs() does the job but may reorder atoms. If you need the original SMILES layout for display and control over which [CH3:number] gets replaced, use a string-based approach with regex and generate the variants you want. Keep an eye on what your code is actually producing with simple print debugging and make sure your script is properly indented and structured so you can trust the outputs you inspect.