2025, Dec 12 23:00

Make DataFrame creation portable between Databricks Connect and PySpark: avoid inference errors with tuples and exact types

Learn why DataFrame creation breaks between Databricks Connect and PySpark (CANNOT_INFER_SCHEMA_FOR_TYPE) and fix it with tuple rows and explicit casting.

Running the same unit tests for an ETL pipeline across databricks-connect in VS Code and plain pyspark inside a Docker container can surface subtle incompatibilities. DataFrame creation is one of those cases: code that works fine with databricks-connect and even inside a Databricks notebook may fail under pyspark with errors like CANNOT_INFER_SCHEMA_FOR_TYPE or CANNOT_ACCEPT_OBJECT_IN_TYPE. The root cause lies in differences in type inference and how strictly schema types are enforced.

Reproducing the issue

The Spark session is initialized to work in both environments. In VS Code it uses databricks-connect; in Docker it falls back to a local pyspark session.

from pyspark.sql import SparkSession

def init_session() -> SparkSession:
    try:
        from databricks.connect import DatabricksSession
        cfg = load_profile()
        return DatabricksSession.builder.remote(
            host=cfg.host,
            token=cfg.token,
            cluster_id=cfg.cluster_id,
        ).getOrCreate()
    except ImportError:
        return (
            SparkSession.builder
            .master("local[4]")
            .config("spark.databricks.clusterUsageTags.clusterAllTags", "{}")
            .getOrCreate()
        )

The first failing case is creating a single-column DataFrame directly from a list of values. This works with databricks-connect, but fails in pyspark with a schema inference error.

spark_ctx = init_session()

frame_values = spark_ctx.createDataFrame(
    [
        "123",
        123456,
        12345678912345,
    ],
    ["my_str_column"],
)

The second failing case is providing an integer for a FloatType field. In databricks-connect this passes, while pyspark rejects the int against FloatType with a strict type error.

from pyspark.sql.types import StructType, StructField, FloatType

rows_raw = [
    (102),
]

schema_spec = StructType([
    StructField("my_float_column", FloatType()),
])

frame_strict = spark_ctx.createDataFrame(rows_raw, schema_spec)

What’s really going on

The difference stems from two behaviors. For single-column inputs, databricks-connect accepts a bare list of values and infers a one-column schema, while pyspark expects each row to be a tuple, even if there is only one column. And for declared schemas, databricks-connect is more permissive with implicit conversions, whereas pyspark enforces the declared types strictly and requires inputs to match those types exactly.

Fix that works in both runtimes

The portable approach is straightforward: always pass rows as tuples, even for a single column, and pre-cast values to the declared types before creating the DataFrame.

# Single-column DataFrame: use tuples for each row
stable_one_col = spark_ctx.createDataFrame(
    [
        ("123",),
        (123456,),
        (12345678912345,),
    ],
    ["my_str_column"],
)

stable_one_col.show()

+--------------+
| my_str_column|
+--------------+
|           123|
|        123456|
|12345678912345|
+--------------+

# Schema enforcement: cast to the exact type expected
from pyspark.sql.types import StructType, StructField, FloatType

rows_casted = [
    (102.0,),
]

schema_spec = StructType([
    StructField("my_float_column", FloatType()),
])

stable_with_schema = spark_ctx.createDataFrame(rows_casted, schema_spec)

stable_with_schema.show()

+---------------+
|my_float_column|
+---------------+
|          102.0|
+---------------+

Why this matters

If unit tests run in GitLab CI on pyspark while local development uses databricks-connect, inconsistent DataFrame construction quickly leads to flaky tests or false negatives. Normalizing input shapes and being explicit about types makes tests deterministic and lets the same code paths run cleanly in both VS Code and Docker-based CI. It also keeps behavior aligned with how Databricks notebooks execute the same syntaxes.

Conclusion

When targeting both databricks-connect and pyspark, treat DataFrame input consistently. For one-column data, feed tuples rather than bare values. When supplying an explicit schema, cast inputs to the exact target types ahead of time. These two practices eliminate the observed discrepancies, keep your unit tests portable across environments, and reduce surprises in CI pipelines.

azure-databricks databricks databricks-connect pyspark python