2025, Oct 04 07:00
Resolve pandas ValueError: Unable to parse string "Reality-TV" in runtimeMinutes from IMDB title.basics.tsv
Fix pandas read_csv failures on IMDB title.basics.tsv by marking non-numeric runtimeMinutes tokens as NA via na_values and Int64, avoiding ValueError safely.
Parsing IMDB's title.basics.tsv with pandas looks straightforward until a simple dtype cast explodes with ValueError: Unable to parse string "Reality-TV". The confusing part is that the reported position may not match what you see around that line in the raw file. Instead of chasing line numbers, it pays off to validate what actually lives in runtimeMinutes and treat non-numeric tokens explicitly.
Reproducing the failure
The typical ingestion step forces a nullable integer via pandas' Int64 and treats "\N" as missing. That is exactly where the error surfaces.
import pandas as pd
movie_data = pd.read_csv("title.basics.tsv",
                         sep="\t",
                         dtype={
                             "runtimeMinutes": "Int64",
                         },
                         na_values={
                             "runtimeMinutes": ["\\N"],
                         })
The exception ValueError: Unable to parse string "Reality-TV" indicates that the column contains values that are not numbers and not covered by the current na_values mapping.
What actually causes the error
The runtimeMinutes field is expected to be numeric but, in practice, it also contains str values. Those text tokens cannot be cast to Int64 during read_csv, hence the parse failure. The practical way forward is to enumerate which unique values block the cast and treat them as missing on load.
Finding the non-numeric tokens
The snippet below reads the file without forcing dtype on runtimeMinutes, scans unique values, and collects anything that fails int(). It also prints the offending values to make the data issues explicit.
import pandas as pd
raw_frame = pd.read_csv("title.basics.tsv",
                        sep="\t",
                        na_values={
                            "runtimeMinutes": ["\\N"],
                        })
def extract_bad_markers(tbl, field_name):
    anomalies = []
    print(f"{'Type':20} | {'Value'}")
    print('-'*53)
    for item in tbl[field_name].unique():
        try:
            int(item)
        except:
            print(f"{str(type(item)):20} | {item}")
            anomalies.append(item)
    print("\nIncorrect values:", anomalies)
    return anomalies
invalid_values = extract_bad_markers(raw_frame, "runtimeMinutes")
The presence of strings such as "Reality-TV" in runtimeMinutes is what triggers the parsing error.
Loading cleanly by marking non-numeric tokens as NA
Once you know the set of invalid markers, instruct read_csv to treat them as missing alongside "\N". Then pandas can safely load the column as Int64.
invalid_values.append("\\N")
clean_titles = pd.read_csv("title.basics.tsv",
                           sep="\t",
                           dtype={
                               "runtimeMinutes": "Int64",
                           },
                           na_values={
                               "runtimeMinutes": invalid_values,
                           })
This approach may take longer during the first run because you scan unique values before the final load. The payoff is a reliable ingestion step you can reuse; after the initial pass, you can save the properly processed DataFrame and consume it directly.
Why you want this in your data pipeline
Schema assumptions are brittle when real-world datasets mix types in a single field. By explicitly discovering and declaring all non-numeric tokens as NA, you make the parser deterministic, keep the nullable integer semantics intact, and avoid chasing misleading positions in error messages. The result is a repeatable load step that fails less and documents the data quirks you must handle downstream.
Takeaways
When dtype casting blows up in pandas, verify the actual domain of the column instead of relying on expectations. Read without the dtype first, enumerate unique values, and capture everything that cannot be int()-ed. Feed that set into na_values and reload with the target dtype. For the IMDB title.basics.tsv case, this turns runtimeMinutes into a proper Int64 column by treating unexpected strings, including "\N", as missing. Store the cleaned dataset to skip the discovery step next time and keep your pipeline fast and predictable.
The article is based on a question from StackOverflow by red_trumpet and an answer by Sindik.