Skip to content

[PySpark] fix: support createDataFrame with list of dicts#388

Open
mariotaddeucci wants to merge 2 commits intoduckdb:mainfrom
mariotaddeucci:fix/spark-create-dataframe-list-of-dict-clean
Open

[PySpark] fix: support createDataFrame with list of dicts#388
mariotaddeucci wants to merge 2 commits intoduckdb:mainfrom
mariotaddeucci:fix/spark-create-dataframe-list-of-dict-clean

Conversation

@mariotaddeucci
Copy link
Contributor

Fixes #183.

Problem

When calling spark.createDataFrame([{"col": value}, ...]), the Spark API failed to infer the schema from dict keys, unlike PySpark which handles this natively.

Solution

Port schema inference logic from duckdb/duckdb#18051.

Changes

duckdb/experimental/spark/sql/types.py

  • Added _type_mappings and _array_type_mappings — dicts mapping Python types to Spark SQL DataTypes
  • Added _has_nulltype() — checks for NullType anywhere in a schema tree
  • Added _merge_type() — merges two DataTypes (used when inferring schema across multiple rows)
  • Added _infer_type() — infers a DataType from a Python object
  • Added _infer_schema() — infers a StructType schema from a dict/namedtuple/Row/object

duckdb/experimental/spark/sql/session.py

  • Updated _combine_data_and_schema() to handle dict rows (extract values in schema field order)
  • Added schema inference branch in createDataFrame() for list-of-dict input without explicit schema
  • Added _inferSchemaFromList() method to SparkSession

tests/fast/spark/test_spark_column.py

  • Removed USE_ACTUAL_SPARK branching in test_struct_column — Row field names are now correctly inferred

tests/fast/spark/test_spark_dataframe.py

  • Added test_dataframe_from_list_dicts covering dicts with different key orders and missing/extra keys

Port schema inference from duckdb/duckdb#18051 to fix duckdb#183.

When calling spark.createDataFrame([{"col": value}, ...]), the Spark API
now infers the schema from dict keys, matching PySpark behavior.

Changes:
- Add _type_mappings, _array_type_mappings, _has_nulltype, _merge_type,
  _infer_type, and _infer_schema functions to types.py
- Update session.py to handle dict rows in _combine_data_and_schema and
  add schema inference branch in createDataFrame for list[dict]
- Add _inferSchemaFromList method to SparkSession
- Fix test_struct_column to use inferred field names instead of col0/col1
- Add test_dataframe_from_list_dicts test case
Copilot AI review requested due to automatic review settings March 19, 2026 01:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Spark-like schema inference so SparkSession.createDataFrame() can accept a list of dict rows and infer column names/types consistently with PySpark behavior.

Changes:

  • Implemented schema/type inference helpers in sql/types.py (infer + merge + NullType detection).
  • Updated SparkSession.createDataFrame() and row conversion to support dict rows and infer schema when none is provided.
  • Extended/adjusted fast Spark test coverage for struct columns and list-of-dicts DataFrame creation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
duckdb/experimental/spark/sql/types.py Adds type mappings plus infer/merge helpers used to infer schemas across Python objects and rows.
duckdb/experimental/spark/sql/session.py Infers schema for list inputs without explicit schema and aligns dict-row values to schema order.
tests/fast/spark/test_spark_column.py Simplifies struct column test now that Row field names are inferred correctly.
tests/fast/spark/test_spark_dataframe.py Adds coverage for createDataFrame from list of dicts (key order differences, missing/extra keys).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +158 to +160
elif isinstance(data, list) and data:
schema = self._inferSchemaFromList(data)
types, names = schema.extract_types_and_names()
Comment on lines +1266 to +1284
return ArrayType(
_merge_type(
a.elementType,
cast(ArrayType, b).elementType,
name="element in array %s" % name,
),
True,
)

elif isinstance(a, MapType):
return MapType(
_merge_type(
a.keyType, cast(MapType, b).keyType, name="key of map %s" % name
),
_merge_type(
a.valueType, cast(MapType, b).valueType, name="value of map %s" % name
),
True,
)
Comment on lines +199 to +201
def _inferSchemaFromList(
self, data: Iterable[Any], names: Optional[List[str]] = None
) -> StructType:
Comment on lines +215 to +219
if not data:
raise PySparkValueError(
error_class="CANNOT_INFER_EMPTY_SCHEMA",
message_parameters={},
)
Comment on lines +226 to +238
schema = reduce(
_merge_type,
(
_infer_schema(
row,
names,
infer_dict_as_struct=infer_dict_as_struct,
infer_array_from_first_element=infer_array_from_first_element,
prefer_timestamp_ntz=prefer_timestamp_ntz,
)
for row in data
),
)
Comment on lines +1317 to +1325
if key is not None and value is not None:
struct.add(
key,
_infer_type(
value,
infer_dict_as_struct,
infer_array_from_first_element,
prefer_timestamp_ntz,
),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DuckDB Spark API is incompatible with the PySpark API's spark.createDataFrame(list of dict)

2 participants