2025, Sep 27 19:00
Handling mixed types in pandas read_csv: read as strings, then convert with to_numeric (coerce, raise)
Learn why pandas on_bad_lines won't catch type errors in CSVs. Read all columns as strings, then convert with to_numeric and a clear errors policy for ETL.
CSV files have a habit of mixing types, especially when human-entered data slips a string into a column that’s supposed to be an integer. If you’re relying on on_bad_lines to catch this, it won’t help. That parameter only kicks in when the parser stumbles on a structural problem, like a delimiter or column-count mismatch, not when a value fails an integer cast. Here’s a clean way to ingest such data without blowing up your pipeline.
Problem setup
Imagine a CSV where numeric columns sometimes contain strings or empty values.
id,name,age
1,,"25"
2,"",30
jj,John,
A straightforward read with dtype hints looks reasonable, but it won’t handle mixed types as you expect.
import pandas as pd
col_types_map = {'id': 'int64', 'name': 'str', 'age': 'int'}
frame_raw = pd.read_csv('a.csv', dtype=col_types_map, on_bad_lines='warn', engine='python')
Why this breaks
The on_bad_lines parameter deals with parsing anomalies such as too many or too few fields in a row. It does not intervene when a field value doesn’t match the target dtype. Casting happens after the row has been tokenized successfully, so a value like jj in an integer column won’t be routed through on_bad_lines. You either get an exception during dtype enforcement or end up with behavior you didn’t intend.
Solution: read as strings, then convert explicitly
The reliable pattern is to read everything as strings first, then convert targeted columns with pandas.to_numeric. The errors parameter lets you choose how to handle incompatible values. Using coerce turns invalid numerics into NaN. Using raise throws an exception. Using ignore leaves the original string intact. Since ignore is deprecated, you can implement equivalent behavior with a small wrapper.
import pandas as pd
rows_all_str = pd.read_csv('a.csv', dtype=str, on_bad_lines='warn', engine='python')
rows_all_str['id'] = pd.to_numeric(rows_all_str['id'], errors='coerce')
rows_all_str['age'] = pd.to_numeric(rows_all_str['age'], errors='coerce')
If you prefer to keep non-numeric values as-is rather than producing NaN, use a safe caster with try/except.
def cast_numeric_safely(x):
    try:
        return pd.to_numeric(x)
    except Exception:
        return x
mixed_frame = pd.read_csv('a.csv', dtype=str, on_bad_lines='warn', engine='python')
mixed_frame['id'] = mixed_frame['id'].apply(cast_numeric_safely)
mixed_frame['age'] = mixed_frame['age'].apply(cast_numeric_safely)
You can choose the behavior by swapping the conversion strategy. The documentation for pandas.to_numeric explains the options for errors: coerce converts to NaN, raise (default) throws, and ignore keeps the string. See: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
What the outcomes look like
Using the ignore logic (built-in or custom), non-numeric values remain strings:
   id  name   age
0   1   NaN  25.0
1   2   NaN  30.0
2  jj  John   NaN
Using errors='coerce' turns invalid numerics into NaN:
    id  name   age
0  1.0   NaN  25.0
1  2.0   NaN  30.0
2  NaN  John   NaN
Using errors='raise' stops on the first bad value:
ValueError: Unable to parse string "jj" at position 0
Why this matters
Data ingestion paths should fail only when the data is structurally broken. Value-level quirks are common and should be handled deterministically. Separating parsing from typing gives you control: you can decide whether to preserve raw values, null them out, or halt the load. It keeps your pipeline predictable and your downstream logic honest.
Takeaways
Don’t rely on on_bad_lines for type issues; it only addresses structural parse errors like unexpected field counts. Read ambiguous columns as strings first, then convert with pandas.to_numeric and an explicit errors policy. If you need to keep non-numeric values, implement a small safe caster rather than leaning on deprecated behavior. This approach produces clearer intent, better error handling, and cleaner data flows.
The article is based on a question from StackOverflow by Despicable me and an answer by Aadvik.