2025, Oct 04 07:00

Resolve pandas ValueError: Unable to parse string "Reality-TV" in runtimeMinutes from IMDB title.basics.tsv

Fix pandas read_csv failures on IMDB title.basics.tsv by marking non-numeric runtimeMinutes tokens as NA via na_values and Int64, avoiding ValueError safely.

Parsing IMDB's title.basics.tsv with pandas looks straightforward until a simple dtype cast explodes with ValueError: Unable to parse string "Reality-TV". The confusing part is that the reported position may not match what you see around that line in the raw file. Instead of chasing line numbers, it pays off to validate what actually lives in runtimeMinutes and treat non-numeric tokens explicitly.

Reproducing the failure

The typical ingestion step forces a nullable integer via pandas' Int64 and treats "\N" as missing. That is exactly where the error surfaces.

import pandas as pd

movie_data = pd.read_csv("title.basics.tsv",
                         sep="\t",
                         dtype={
                             "runtimeMinutes": "Int64",
                         },
                         na_values={
                             "runtimeMinutes": ["\\N"],
                         })

The exception ValueError: Unable to parse string "Reality-TV" indicates that the column contains values that are not numbers and not covered by the current na_values mapping.

What actually causes the error

The runtimeMinutes field is expected to be numeric but, in practice, it also contains str values. Those text tokens cannot be cast to Int64 during read_csv, hence the parse failure. The practical way forward is to enumerate which unique values block the cast and treat them as missing on load.

Finding the non-numeric tokens

The snippet below reads the file without forcing dtype on runtimeMinutes, scans unique values, and collects anything that fails int(). It also prints the offending values to make the data issues explicit.

import pandas as pd

raw_frame = pd.read_csv("title.basics.tsv",
                        sep="\t",
                        na_values={
                            "runtimeMinutes": ["\\N"],
                        })

def extract_bad_markers(tbl, field_name):
    anomalies = []

    print(f"{'Type':20} | {'Value'}")
    print('-'*53)
    for item in tbl[field_name].unique():
        try:
            int(item)
        except:
            print(f"{str(type(item)):20} | {item}")
            anomalies.append(item)

    print("\nIncorrect values:", anomalies)
    return anomalies

invalid_values = extract_bad_markers(raw_frame, "runtimeMinutes")

The presence of strings such as "Reality-TV" in runtimeMinutes is what triggers the parsing error.

Loading cleanly by marking non-numeric tokens as NA

Once you know the set of invalid markers, instruct read_csv to treat them as missing alongside "\N". Then pandas can safely load the column as Int64.

invalid_values.append("\\N")

clean_titles = pd.read_csv("title.basics.tsv",
                           sep="\t",
                           dtype={
                               "runtimeMinutes": "Int64",
                           },
                           na_values={
                               "runtimeMinutes": invalid_values,
                           })

This approach may take longer during the first run because you scan unique values before the final load. The payoff is a reliable ingestion step you can reuse; after the initial pass, you can save the properly processed DataFrame and consume it directly.

Why you want this in your data pipeline

Schema assumptions are brittle when real-world datasets mix types in a single field. By explicitly discovering and declaring all non-numeric tokens as NA, you make the parser deterministic, keep the nullable integer semantics intact, and avoid chasing misleading positions in error messages. The result is a repeatable load step that fails less and documents the data quirks you must handle downstream.

Takeaways

When dtype casting blows up in pandas, verify the actual domain of the column instead of relying on expectations. Read without the dtype first, enumerate unique values, and capture everything that cannot be int()-ed. Feed that set into na_values and reload with the target dtype. For the IMDB title.basics.tsv case, this turns runtimeMinutes into a proper Int64 column by treating unexpected strings, including "\N", as missing. Store the cleaned dataset to skip the discovery step next time and keep your pipeline fast and predictable.

The article is based on a question from StackOverflow by red_trumpet and an answer by Sindik.

integer-overflow pandas python