[PySpark] fix: support createDataFrame with list of dicts by mariotaddeucci · Pull Request #388 · duckdb/duckdb-python

mariotaddeucci · 2026-03-19T01:42:04Z

Fixes #183.

Problem

When calling spark.createDataFrame([{"col": value}, ...]), the Spark API failed to infer the schema from dict keys, unlike PySpark which handles this natively.

Solution

Port schema inference logic from duckdb/duckdb#18051.

Changes

`duckdb/experimental/spark/sql/types.py`

Added _type_mappings and _array_type_mappings — dicts mapping Python types to Spark SQL DataTypes
Added _has_nulltype() — checks for NullType anywhere in a schema tree
Added _merge_type() — merges two DataTypes (used when inferring schema across multiple rows)
Added _infer_type() — infers a DataType from a Python object
Added _infer_schema() — infers a StructType schema from a dict/namedtuple/Row/object

`duckdb/experimental/spark/sql/session.py`

Updated _combine_data_and_schema() to handle dict rows (extract values in schema field order)
Added schema inference branch in createDataFrame() for list-of-dict input without explicit schema
Added _inferSchemaFromList() method to SparkSession

`tests/fast/spark/test_spark_column.py`

Removed USE_ACTUAL_SPARK branching in test_struct_column — Row field names are now correctly inferred

`tests/fast/spark/test_spark_dataframe.py`

Added test_dataframe_from_list_dicts covering dicts with different key orders and missing/extra keys

Port schema inference from duckdb/duckdb#18051 to fix duckdb#183. When calling spark.createDataFrame([{"col": value}, ...]), the Spark API now infers the schema from dict keys, matching PySpark behavior. Changes: - Add _type_mappings, _array_type_mappings, _has_nulltype, _merge_type, _infer_type, and _infer_schema functions to types.py - Update session.py to handle dict rows in _combine_data_and_schema and add schema inference branch in createDataFrame for list[dict] - Add _inferSchemaFromList method to SparkSession - Fix test_struct_column to use inferred field names instead of col0/col1 - Add test_dataframe_from_list_dicts test case

Copilot

Pull request overview

Adds Spark-like schema inference so SparkSession.createDataFrame() can accept a list of dict rows and infer column names/types consistently with PySpark behavior.

Changes:

Implemented schema/type inference helpers in sql/types.py (infer + merge + NullType detection).
Updated SparkSession.createDataFrame() and row conversion to support dict rows and infer schema when none is provided.
Extended/adjusted fast Spark test coverage for struct columns and list-of-dicts DataFrame creation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
duckdb/experimental/spark/sql/types.py	Adds type mappings plus infer/merge helpers used to infer schemas across Python objects and rows.
duckdb/experimental/spark/sql/session.py	Infers schema for list inputs without explicit schema and aligns dict-row values to schema order.
tests/fast/spark/test_spark_column.py	Simplifies struct column test now that Row field names are inferred correctly.
tests/fast/spark/test_spark_dataframe.py	Adds coverage for `createDataFrame` from list of dicts (key order differences, missing/extra keys).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

duckdb/experimental/spark/sql/session.py

+        elif isinstance(data, list) and data:
+            schema = self._inferSchemaFromList(data)
+            types, names = schema.extract_types_and_names()


duckdb/experimental/spark/sql/types.py

+        return ArrayType(
+            _merge_type(
+                a.elementType,
+                cast(ArrayType, b).elementType,
+                name="element in array %s" % name,
+            ),
+            True,
+        )
+
+    elif isinstance(a, MapType):
+        return MapType(
+            _merge_type(
+                a.keyType, cast(MapType, b).keyType, name="key of map %s" % name
+            ),
+            _merge_type(
+                a.valueType, cast(MapType, b).valueType, name="value of map %s" % name
+            ),
+            True,
+        )


duckdb/experimental/spark/sql/session.py

+    def _inferSchemaFromList(
+        self, data: Iterable[Any], names: Optional[List[str]] = None
+    ) -> StructType:


duckdb/experimental/spark/sql/session.py

+        if not data:
+            raise PySparkValueError(
+                error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                message_parameters={},
+            )


duckdb/experimental/spark/sql/session.py

+        schema = reduce(
+            _merge_type,
+            (
+                _infer_schema(
+                    row,
+                    names,
+                    infer_dict_as_struct=infer_dict_as_struct,
+                    infer_array_from_first_element=infer_array_from_first_element,
+                    prefer_timestamp_ntz=prefer_timestamp_ntz,
+                )
+                for row in data
+            ),
+        )


duckdb/experimental/spark/sql/types.py

+                if key is not None and value is not None:
+                    struct.add(
+                        key,
+                        _infer_type(
+                            value,
+                            infer_dict_as_struct,
+                            infer_array_from_first_element,
+                            prefer_timestamp_ntz,
+                        ),


Copilot AI review requested due to automatic review settings March 19, 2026 01:42

Copilot AI reviewed Mar 19, 2026

View reviewed changes

revert: restore test_struct_column to original main version

a1017b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PySpark] fix: support createDataFrame with list of dicts#388

[PySpark] fix: support createDataFrame with list of dicts#388
mariotaddeucci wants to merge 2 commits intoduckdb:mainfrom
mariotaddeucci:fix/spark-create-dataframe-list-of-dict-clean

mariotaddeucci commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mariotaddeucci commented Mar 19, 2026

Problem

Solution

Changes

duckdb/experimental/spark/sql/types.py

duckdb/experimental/spark/sql/session.py

tests/fast/spark/test_spark_column.py

tests/fast/spark/test_spark_dataframe.py

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`duckdb/experimental/spark/sql/types.py`

`duckdb/experimental/spark/sql/session.py`

`tests/fast/spark/test_spark_column.py`

`tests/fast/spark/test_spark_dataframe.py`