[PySpark] fix: support createDataFrame with list of dicts in Spark API by mariotaddeucci · Pull Request #387 · duckdb/duckdb-python

mariotaddeucci · 2026-03-19T01:38:34Z

Summary

Port schema inference from duckdb/duckdb#18051 to fix #183
When calling spark.createDataFrame([{"col": value}, ...]), the Spark API now infers the schema from dict keys, matching PySpark behavior

Changes

`duckdb/experimental/spark/sql/types.py`

Added _type_mappings and _array_type_mappings dicts mapping Python types to Spark SQL DataTypes
Added _has_nulltype() to check for NullType in a schema tree
Added _merge_type() to merge two DataTypes (used when inferring schema across multiple rows)
Added _infer_type() to infer a DataType from a Python object
Added _infer_schema() to infer a StructType schema from a dict/namedtuple/Row/object

`duckdb/experimental/spark/sql/session.py`

Updated _combine_data_and_schema() to handle dict rows by extracting values in schema field order
Added schema inference branch in createDataFrame() for list[dict] input without explicit schema
Added _inferSchemaFromList() method to SparkSession

`tests/fast/spark/test_spark_column.py`

Simplified test_struct_column — removed USE_ACTUAL_SPARK branching since Row field names are now correctly inferred

`tests/fast/spark/test_spark_dataframe.py`

Added test_dataframe_from_list_dicts test covering dicts with different key orders and missing keys

… for ContributionsAcceptedError message regex

Port schema inference from duckdb/duckdb#18051 to fix duckdb#183. When calling spark.createDataFrame([{"col": value}, ...]), the Spark API now infers the schema from dict keys, matching PySpark behavior. Changes: - Add _type_mappings, _array_type_mappings, _has_nulltype, _merge_type, _infer_type, and _infer_schema functions to types.py - Update session.py to handle dict rows in _combine_data_and_schema and add schema inference branch in createDataFrame for list[dict] - Add _inferSchemaFromList method to SparkSession - Fix test_struct_column to use inferred field names instead of col0/col1 - Add test_dataframe_from_list_dicts test case

mariotaddeucci · 2026-03-19T01:41:50Z

Closing in favor of a cleaner branch with only the relevant commit. New PR coming.

Copilot

Pull request overview

This PR improves DuckDB’s experimental Spark API compatibility by adding schema inference for createDataFrame() when passed list[dict], and it also introduces initial WindowSpec/window-function support used by new tests.

Changes:

Infer StructType schema from list[dict] input (unioning keys across rows) and align dict row values by schema field order.
Add schema inference utilities in sql/types.py (_infer_schema, _infer_type, _merge_type, _has_nulltype).
Introduce basic window specification & window functions (WindowSpec, Window, Column.over, and functions like row_number, lag, etc.) plus associated tests and namespace plumbing.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/spark_namespace/sql/window.py	Adds namespace shim to import `Window` from PySpark vs DuckDB Spark API.
tests/fast/spark/test_spark_functions_window.py	Adds coverage for window specs/functions behavior (orderBy/partitionBy/rowsBetween/rangeBetween/lag/lead/etc.).
tests/fast/spark/test_spark_dataframe.py	Adds test for `createDataFrame(list_of_dicts)` schema inference and missing keys.
tests/fast/spark/test_spark_column.py	Simplifies struct column test now that field names are inferred correctly.
external/duckdb	Bumps DuckDB submodule revision (dependency update).
duckdb/experimental/spark/sql/window.py	Implements `WindowSpec` and `Window` API.
duckdb/experimental/spark/sql/types.py	Adds type mapping + schema/type inference & merge helpers.
duckdb/experimental/spark/sql/session.py	Wires schema inference for list input and dict-row alignment in `_combine_data_and_schema`.
duckdb/experimental/spark/sql/functions.py	Adds window functions wrappers (row_number, rank, lag/lead, etc.).
duckdb/experimental/spark/sql/column.py	Adds `Column.over(WindowSpec)` to render `... OVER (...)` SQL.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

duckdb/experimental/spark/sql/window.py

+from collections.abc import Sequence
+
+from ..errors import PySparkTypeError
+from ..exception import ContributionsAcceptedError
+from ._typing import ColumnOrName
+from .column import Column


duckdb/experimental/spark/sql/window.py

+        new_window._range_between = self._range_between
+        return new_window
+
+    def partitionBy(self, *cols: ColumnOrName | Sequence[ColumnOrName]) -> "WindowSpec":


duckdb/experimental/spark/sql/window.py

+        all_cols: list[ColumnOrName] | list[list[ColumnOrName]] = list(cols)  # type: ignore[assignment]
+
+        if isinstance(all_cols[0], list):
+            all_cols = all_cols[0]


duckdb/experimental/spark/sql/window.py

+        return f"{start} PRECEDING AND {end} FOLLOWING"
+


tests/fast/spark/test_spark_functions_window.py

+    def test_moving_average_last_3_points(self, spark):
+        data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]
+        df = spark.createDataFrame(data=data, schema=["idx", "value"])
+        w = Window.orderBy("idx").rowsBetween(2, Window.currentRow)


tests/fast/spark/test_spark_functions_window.py

+        # rows within a value distance of 2 up to the current row.
+        data = [(1, 10), (2, 20), (3, 30), (4, 40), (6, 60)]
+        df = spark.createDataFrame(data=data, schema=["idx", "value"])
+        w = Window.orderBy("idx").rangeBetween(2, Window.currentRow)


duckdb/experimental/spark/sql/session.py

            if isinstance(schema, StructType):
                types, names = schema.extract_types_and_names()
            else:
                names = schema


duckdb/experimental/spark/sql/session.py

    new_data = []
    for row in data:
-        new_row = [Value(x, dtype.duckdb_type) for x, dtype in zip(row, [y.dataType for y in schema], strict=False)]
+        if isinstance(row, dict):
+            row_values = list(map(row.get, schema.fieldNames()))
+        else:
+            row_values = list(row)
+        new_row = [Value(x, dtype.duckdb_type) for x, dtype in zip(row_values, [y.dataType for y in schema], strict=False)]
        new_data.append(new_row)
    return new_data


duckdb/experimental/spark/sql/types.py

 import duckdb
 from duckdb.sqltypes import DuckDBPyType

+from ..errors.exceptions.base import PySparkTypeError


duckdb/experimental/spark/sql/column.py

+
+        Parameters
+        ----------
+        window : :class:`WindowSpec`


mariotaddeucci added 6 commits March 18, 2026 18:04

[PySpark] - Add window function support

ea28569

Refactor WindowSpec to use built-in list and tuple types; update test…

0252e92

… for ContributionsAcceptedError message regex

Add lag and lead window functions with tests

2f254fc

Add nth_value window function with tests

9718dcd

fix: apply ruff auto-fixes for linting issues

e8546f8

Copilot AI review requested due to automatic review settings March 19, 2026 01:38

mariotaddeucci changed the title ~~fix: support createDataFrame with list of dicts in Spark API~~ [PySpark] fix: support createDataFrame with list of dicts in Spark API Mar 19, 2026

mariotaddeucci closed this Mar 19, 2026

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PySpark] fix: support createDataFrame with list of dicts in Spark API#387

[PySpark] fix: support createDataFrame with list of dicts in Spark API#387
mariotaddeucci wants to merge 6 commits intoduckdb:mainfrom
mariotaddeucci:fix/spark-create-dataframe-list-of-dict

mariotaddeucci commented Mar 19, 2026

Uh oh!

mariotaddeucci commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mariotaddeucci commented Mar 19, 2026

Summary

Changes

duckdb/experimental/spark/sql/types.py

duckdb/experimental/spark/sql/session.py

tests/fast/spark/test_spark_column.py

tests/fast/spark/test_spark_dataframe.py

Uh oh!

mariotaddeucci commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`duckdb/experimental/spark/sql/types.py`

`duckdb/experimental/spark/sql/session.py`

`tests/fast/spark/test_spark_column.py`

`tests/fast/spark/test_spark_dataframe.py`