2025, Dec 01 19:00

Regex-Only Guide to Strip JSONC Comments in Python Without Breaking Strings or URLs

Learn a robust regex-only method in Python to remove JSONC // and /* */ comments while preserving quoted strings and URLs, producing clean, valid JSON safely.

Stripping comments from JSONC to obtain valid JSON looks simple until string literals get in the way. URLs, paths, and other values can contain double slashes, so a naive regex that nukes everything from // to the end of a line will corrupt data. Below is a robust, regex-only approach in Python that preserves quoted strings while removing comments.

Problem overview

The input is JSONC, a superset of JSON that allows comments. The task is to remove comments and end up with valid JSON using only regex. The catch is that // may appear inside quoted strings, and a straightforward pattern removes too much.

Example that demonstrates the issue

First, a naive attempt that fails when // appears in string values:

import re
bad_rule = re.compile(r'\s//[^}]*')
result = bad_rule.sub('', data)

Consider this input. It is valid JSONC, pretty-formatted, and guaranteed to become valid JSON once comments are correctly removed:

//tried this sed -r 's#\s//[^}]*##'
//  also tried this '%^*3//s39()'
[
  {
    "test1" : "http://test.com",
    "test2" : "http://test.com",//test
    // any thing
    "ok" : 3,  //here 2
    "//networkpath1" : true, //whynot
    "//networkpath2" : true 
// ok
  },//eof
  {
    "statement" : "I like test cases"
}//eof
]

The naive regex will incorrectly strip parts of values containing http://, because it does not distinguish between // inside strings and actual comment delimiters.

Why the naive approach breaks

Regex that blindly targets // up to some boundary does not understand the grammar of string literals. When // occurs within quotes, it is not the start of a comment, but the regex matches it anyway and removes legitimate content. The same risk exists for block comments if they are used, as /* and */ could also appear within quoted text.

Regex that preserves strings and removes comments

The workable strategy is to match both comments and quoted strings, capture the strings, and then re-insert the captured strings in the replacement. This keeps string literals intact while deleting line and block comments. The approach does not rely on indentation or specific line separators.

import re
rx_cleanup = re.compile(
    r'//.*|/\*[\s\S]*?\*/|("(\\.|.)*?")'
)
sanitized = rx_cleanup.sub(r'\1', data)

The pattern alternates between three parts. It matches // comments to the end of the line. It matches /* ... */ blocks using a construct that spans across any characters. And crucially, it matches a quoted string as a capturing group that allows escaped characters inside; the substitution injects that group back, so quoted content survives untouched while comments are removed.

Why this matters

Removing comments from JSONC with regex is attractive for quick preprocessing, but correctness hinges on not mangling string values. URLs and similar data frequently contain //, and any accidental removal leads to invalid JSON or broken semantics. Preserving quoted strings while stripping comments addresses this without depending on specific formatting.

Takeaways

If you must use regex for converting JSONC to JSON, protect quoted strings and re-insert them during substitution. The shown pattern targets both // and /* ... */ styles, even though only // may be needed, and it does not assume a particular indentation or line style. With the guarantee that removing comments yields valid JSON, this method provides a concise and practical solution.