https://pytroubles.com/en/posts/id223-pandas-read-csv-mixed-types-why-on-bad-lines-won-t-help-and-safe-casting-with-to-numeric

Pandas read_csv mixed types: why on_bad_lines won't help, and safe casting with to_numeric

Handling mixed types in pandas read_csv: read as strings, then convert with to_numeric (coerce, raise)

Pandas read_csv mixed types: why on_bad_lines won't help, and safe casting with to_numeric

Learn why pandas on_bad_lines won't catch type errors in CSVs. Read all columns as strings, then convert with to_numeric and a clear errors policy for ETL.

2025-09-27T19:00:05+03:00

2025-09-27T19:00:06+03:00

CSV files have a habit of mixing types, especially when human-entered data slips a string into a column that’s supposed to be an integer. If you’re relying on on_bad_lines to catch this, it won’t help. That parameter only kicks in when the parser stumbles on a structural problem, like a delimiter or column-count mismatch, not when a value fails an integer cast. Here’s a clean way to ingest such data without blowing up your pipeline.Problem setupImagine a CSV where numeric columns sometimes contain strings or empty values.id,name,age 1,,"25" 2,"",30 jj,John, A straightforward read with dtype hints looks reasonable, but it won’t handle mixed types as you expect.import pandas as pd col_types_map = {'id': 'int64', 'name': 'str', 'age': 'int'} frame_raw = pd.read_csv('a.csv', dtype=col_types_map, on_bad_lines='warn', engine='python') Why this breaksThe on_bad_lines parameter deals with parsing anomalies such as too many or too few fields in a row. It does not intervene when a field value doesn’t match the target dtype. Casting happens after the row has been tokenized successfully, so a value like jj in an integer column won’t be routed through on_bad_lines. You either get an exception during dtype enforcement or end up with behavior you didn’t intend.Solution: read as strings, then convert explicitlyThe reliable pattern is to read everything as strings first, then convert targeted columns with pandas.to_numeric. The errors parameter lets you choose how to handle incompatible values. Using coerce turns invalid numerics into NaN. Using raise throws an exception. Using ignore leaves the original string intact. Since ignore is deprecated, you can implement equivalent behavior with a small wrapper.import pandas as pd rows_all_str = pd.read_csv('a.csv', dtype=str, on_bad_lines='warn', engine='python') rows_all_str['id'] = pd.to_numeric(rows_all_str['id'], errors='coerce') rows_all_str['age'] = pd.to_numeric(rows_all_str['age'], errors='coerce') If you prefer to keep non-numeric values as-is rather than producing NaN, use a safe caster with try/except.def cast_numeric_safely(x): try: return pd.to_numeric(x) except Exception: return x mixed_frame = pd.read_csv('a.csv', dtype=str, on_bad_lines='warn', engine='python') mixed_frame['id'] = mixed_frame['id'].apply(cast_numeric_safely) mixed_frame['age'] = mixed_frame['age'].apply(cast_numeric_safely) You can choose the behavior by swapping the conversion strategy. The documentation for pandas.to_numeric explains the options for errors: coerce converts to NaN, raise (default) throws, and ignore keeps the string. See: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.htmlWhat the outcomes look likeUsing the ignore logic (built-in or custom), non-numeric values remain strings: id name age 0 1 NaN 25.0 1 2 NaN 30.0 2 jj John NaN Using errors='coerce' turns invalid numerics into NaN: id name age 0 1.0 NaN 25.0 1 2.0 NaN 30.0 2 NaN John NaN Using errors='raise' stops on the first bad value:ValueError: Unable to parse string "jj" at position 0 Why this mattersData ingestion paths should fail only when the data is structurally broken. Value-level quirks are common and should be handled deterministically. Separating parsing from typing gives you control: you can decide whether to preserve raw values, null them out, or halt the load. It keeps your pipeline predictable and your downstream logic honest.TakeawaysDon’t rely on on_bad_lines for type issues; it only addresses structural parse errors like unexpected field counts. Read ambiguous columns as strings first, then convert with pandas.to_numeric and an explicit errors policy. If you need to keep non-numeric values, implement a small safe caster rather than leaning on deprecated behavior. This approach produces clearer intent, better error handling, and cleaner data flows.

pandas read_csv, on_bad_lines, mixed types, dtype, to_numeric, errors=coerce, errors=raise, CSV ingestion, data pipeline, safe casting, parse errors, convert strings to numbers

2025

2025, Sep 27 19:00

Handling mixed types in pandas read_csv: read as strings, then convert with to_numeric (coerce, raise)

Learn why pandas on_bad_lines won't catch type errors in CSVs. Read all columns as strings, then convert with to_numeric and a clear errors policy for ETL.

Problem setup

Imagine a CSV where numeric columns sometimes contain strings or empty values.

id,name,age
1,,"25"
2,"",30
jj,John,

A straightforward read with dtype hints looks reasonable, but it won’t handle mixed types as you expect.

import pandas as pd
col_types_map = {'id': 'int64', 'name': 'str', 'age': 'int'}
frame_raw = pd.read_csv('a.csv', dtype=col_types_map, on_bad_lines='warn', engine='python')

Why this breaks

The on_bad_lines parameter deals with parsing anomalies such as too many or too few fields in a row. It does not intervene when a field value doesn’t match the target dtype. Casting happens after the row has been tokenized successfully, so a value like jj in an integer column won’t be routed through on_bad_lines. You either get an exception during dtype enforcement or end up with behavior you didn’t intend.

Solution: read as strings, then convert explicitly

The reliable pattern is to read everything as strings first, then convert targeted columns with pandas.to_numeric. The errors parameter lets you choose how to handle incompatible values. Using coerce turns invalid numerics into NaN. Using raise throws an exception. Using ignore leaves the original string intact. Since ignore is deprecated, you can implement equivalent behavior with a small wrapper.

import pandas as pd
rows_all_str = pd.read_csv('a.csv', dtype=str, on_bad_lines='warn', engine='python')
rows_all_str['id'] = pd.to_numeric(rows_all_str['id'], errors='coerce')
rows_all_str['age'] = pd.to_numeric(rows_all_str['age'], errors='coerce')

If you prefer to keep non-numeric values as-is rather than producing NaN, use a safe caster with try/except.

def cast_numeric_safely(x):
    try:
        return pd.to_numeric(x)
    except Exception:
        return x
mixed_frame = pd.read_csv('a.csv', dtype=str, on_bad_lines='warn', engine='python')
mixed_frame['id'] = mixed_frame['id'].apply(cast_numeric_safely)
mixed_frame['age'] = mixed_frame['age'].apply(cast_numeric_safely)

You can choose the behavior by swapping the conversion strategy. The documentation for pandas.to_numeric explains the options for errors: coerce converts to NaN, raise (default) throws, and ignore keeps the string. See: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

What the outcomes look like

Using the ignore logic (built-in or custom), non-numeric values remain strings:

   id  name   age
0   1   NaN  25.0
1   2   NaN  30.0
2  jj  John   NaN

Using errors='coerce' turns invalid numerics into NaN:

    id  name   age
0  1.0   NaN  25.0
1  2.0   NaN  30.0
2  NaN  John   NaN

Using errors='raise' stops on the first bad value:

ValueError: Unable to parse string "jj" at position 0

Why this matters

Data ingestion paths should fail only when the data is structurally broken. Value-level quirks are common and should be handled deterministically. Separating parsing from typing gives you control: you can decide whether to preserve raw values, null them out, or halt the load. It keeps your pipeline predictable and your downstream logic honest.

Takeaways

Don’t rely on on_bad_lines for type issues; it only addresses structural parse errors like unexpected field counts. Read ambiguous columns as strings first, then convert with pandas.to_numeric and an explicit errors policy. If you need to keep non-numeric values, implement a small safe caster rather than leaning on deprecated behavior. This approach produces clearer intent, better error handling, and cleaner data flows.

The article is based on a question from StackOverflow by Despicable me and an answer by Aadvik.

pandas python