2025, Sep 29 09:00

Distinguish empty strings from missing values in pandas read_csv using csv.QUOTE_NONE and normalization

Preserve empty strings vs NaN in pandas read_csv: use csv.QUOTE_NONE, then normalize columns to keep '' empty, ,, as NaN, and numbers parsed accurately.

Distinguishing an intentionally empty string from a missing CSV value sounds trivial until pandas quietly normalizes both away. If your file mixes empty quoted fields like "" and truly missing entries like ,, you likely want the former to survive as an empty string and the latter to become NaN. Out of the box, read_csv() won’t preserve that distinction.

The setup: what gets lost on read

Consider a tiny CSV where both patterns appear side by side:

id,name,age
1,,"25"
2,"",3
3,John,

Loaded with default parsing, the quotes don’t carry meaning; the parser will infer numeric types and treat blanks as NaN:

import io
import pandas as pd
sample = """id,name,age
1,,"25"
2,"",3
3,John,"""
frame_a = pd.read_csv(io.StringIO(sample))
print(frame_a.dtypes)
# id        int64
# name     object
# age     float64
# dtype: object

Why this happens

CSV quoting is purely a field delimiter and escape mechanism; it isn’t a typing hint. Quotes let commas or special characters live inside a field, they do not imply any type. Because of that, "25" is normally parsed just like 25 and will be inferred as a number. Likewise, a completely empty field between separators is interpreted as missing.

Switching quotechar to another symbol can make "" survive as a literal string, but that breaks as soon as you actually need double quotes to protect text with commas inside. You would also end up having to scrub quotes from every relevant string later.

Workable approach: disable quote interpretation, then normalize

If your quoted fields never contain the separator, you can read the file with quoting=3 (csv.QUOTE_NONE) to treat quotes as ordinary characters, then strip leading/trailing quotes only in object columns. This preserves the distinction: ,, becomes NaN, while "" becomes an empty string after cleanup.

import io
import pandas as pd
payload = """id,name,age
1,,"25"
2,"",3
3,John,"""
grid = pd.read_csv(io.StringIO(payload), quoting=3)
#    id  name   age
# 0   1   NaN  "25"
# 1   2    ""     3
# 2   3  John   NaN
grid.update(
    grid.select_dtypes('O').apply(
        lambda col: col.str.replace(r'^"|"$', '', regex=True)
    )
)
print(grid)
#    id  name  age
# 0   1   NaN   25
# 1   2        3
# 2   3  John  NaN

At this point, all non-NaN values in age are strings, even if they print like numbers:

print(grid['age'].tolist())
# ['25', '3', nan]

Variant: keep numbers numeric, preserve quoted strings as strings

If you want "25" to stay a string while 3 remains numeric, coerce numeric values first and fall back to the de-quoted text for anything else:

import io
import pandas as pd
payload = """id,name,age
1,,"25"
2,"",3
3,John,"""
canvas = pd.read_csv(io.StringIO(payload), quoting=3)
canvas.update(
    canvas.select_dtypes('O').apply(
        lambda col: pd.to_numeric(col, errors='coerce').combine_first(
            col.str.replace(r'^"|"$', '', regex=True)
        )
    )
)
print(canvas)
#    id  name  age
# 0   1   NaN   25
# 1   2        3.0
# 2   3  John  NaN
print(canvas['name'].tolist())
# [nan, '', 'John']
print(canvas['age'].tolist())
# ['25', 3.0, nan]

Important limitation

The approach with quoting=3 relies on the assumption that quoted fields don’t contain the separator. It will break as soon as a quoted value actually needs quoting to include a comma:

id,name,age
1,,"25"
2,",",3
3,John,

In such cases, pre-processing the file or writing a custom parser is the safe route if you must infer types based on quoting.

Why you should care

There is often a semantic difference between an empty string and a missing value. Analytics, joins, validations, and user-facing exports can all depend on this distinction. Preserving it early in the ingest pipeline prevents subtle bugs and avoids rework later.

Takeaways

CSV itself does not encode type information; quotes do not change that. If you need to treat ,, as NaN and "" as an empty string, read with quoting disabled and then normalize textual columns. If you also want to keep numeric fields numeric while retaining quoted text as strings, apply numeric coercion first and fall back to the de-quoted text. When your data truly requires quotes to protect delimiters, plan for pre-processing or a custom parser.

The article is based on a question from StackOverflow by Despicable me and an answer by mozway.