2025, Dec 26 19:00

Avoid nulls when converting string dates in PySpark: cast(DateType) vs to_date with explicit format strings

Learn why PySpark cast(DateType) turns non-ISO string dates into nulls and how to fix it using to_date with explicit formats. Prevent data loss in mixed data.

When string columns actually store dates in multiple human-readable formats, a naive cast to DateType in PySpark looks tempting. It often works for ISO-like strings, but falls apart on other layouts. A typical symptom is a column that clearly contains dates, yet turns into nulls after casting.

Reproducing the issue

The example below inspects a DataFrame, tries to infer types from min and max values via Python’s datetime parsing, and then casts columns accordingly. Despite correct identification of date-like strings on the Python side, the Date2 and Date3 columns end up as nulls after casting.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min, max
from pyspark.sql.types import IntegerType, FloatType, TimestampType, StringType, DateType
from datetime import datetime, date
# Parse a single value into int, float, timestamp, date, or leave as-is
def coerce_atom(x):
    try:
        return int(x)
    except ValueError:
        pass
    try:
        return float(x)
    except ValueError:
        pass
    ts_layouts = [
        '%m/%d/%Y %H:%M:%S', '%Y-%m-%d %H:%M:%S',
        '%Y-%m-%dT%H:%M:%S', '%Y-%m-%d %H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S.%f'
    ]
    for p in ts_layouts:
        try:
            return datetime.strptime(x, p)
        except ValueError:
            pass
    d_layouts = [
        '%Y-%m-%d', '%d-%m-%Y', '%m/%d/%Y', '%d/%m/%Y', '%Y/%m/%d',
        '%b %d, %Y', '%d %b %Y'
    ]
    for p in d_layouts:
        try:
            return datetime.strptime(x, p).date()
        except ValueError:
            pass
    return x
# Infer the Spark SQL type for a column based on sample extrema
def sniff_dtype(frame, field):
    lo_val = frame.select(min(col(field))).collect()[0][0]
    hi_val = frame.select(max(col(field))).collect()[0][0]
    for token in [lo_val, hi_val]:
        if token is not None:
            parsed = coerce_atom(token)
            print(f"Field: {field}, Raw: {token}, Parsed: {parsed}")
            if isinstance(parsed, int):
                return IntegerType()
            elif isinstance(parsed, float):
                return FloatType()
            elif isinstance(parsed, datetime):
                return TimestampType()
            elif isinstance(parsed, date):
                return DateType()
    return StringType()
# Sample rows with different date formats per column
sample_rows = [
    ('1', '2021-01-01', '01-02-2021', '1/2/2021', '2021-01-01T12:34:56', '1.1', 1),
    ('2', '2021-02-01', '02-03-2021', '2/3/2021', '2021-02-01T13:45:56', '2.2', 2),
    ('3', '2021-03-01', '03-04-2021', '3/4/2021', '2021-03-01T14:56:56', '3.3', 3)
]
spark_app = SparkSession.builder.appName("example").getOrCreate()
headers = ['A', 'Date1', 'Date2', 'Date3', 'Date4', 'C', 'D']
frame = spark_app.createDataFrame(sample_rows, headers)
# Cast each column to the inferred Spark SQL type
for field in frame.columns:
    guess_t = sniff_dtype(frame, field)
    frame = frame.withColumn(field, frame[field].cast(guess_t))
frame.show()
frame.dtypes

What’s really happening

The Python step successfully parses strings like 01-02-2021 and 1/2/2021 into date objects. The failure comes later, during the Spark cast. cast(DateType()) accepts only ISO-like patterns such as yyyy-MM-dd for Date and yyyy-MM-dd HH:mm:ss[.SSS] for Timestamp. Strings like 01-02-2021 or 1/2/2021 don’t match that pattern, so Spark turns them into null during the cast.

The fix

Instead of relying on cast(DateType()) for non-ISO strings, convert text to dates with to_date and an explicit format per column.

from pyspark.sql.functions import to_date
frame = frame.withColumn("Date1", to_date("Date1", "yyyy-MM-dd"))
frame = frame.withColumn("Date2", to_date("Date2", "dd-MM-yyyy"))
frame = frame.withColumn("Date3", to_date("Date3", "M/d/yyyy"))

This keeps Date2 and Date3 from collapsing into null values and yields proper Date columns by telling Spark exactly how to parse each string.

Why this is worth remembering

Python’s datetime parsing and Spark’s casting are different stages with different expectations. It’s easy to assume that if Python can parse the value, a subsequent cast will “just work”. It won’t unless the string already conforms to Spark’s accepted patterns. When input data comes in varied date layouts, directing Spark with to_date and the correct pattern avoids silent data loss.

Conclusion

When converting string dates in PySpark, reserve cast(DateType()) for ISO-like text. For columns like 01-02-2021 or 1/2/2021, switch to to_date with an explicit format string. This small change eliminates unexpected nulls and keeps your date columns reliable end to end.