[PySpark] fix: support createDataFrame with list of dicts#388
Open
mariotaddeucci wants to merge 2 commits intoduckdb:mainfrom
Open
[PySpark] fix: support createDataFrame with list of dicts#388mariotaddeucci wants to merge 2 commits intoduckdb:mainfrom
mariotaddeucci wants to merge 2 commits intoduckdb:mainfrom
Conversation
Port schema inference from duckdb/duckdb#18051 to fix duckdb#183. When calling spark.createDataFrame([{"col": value}, ...]), the Spark API now infers the schema from dict keys, matching PySpark behavior. Changes: - Add _type_mappings, _array_type_mappings, _has_nulltype, _merge_type, _infer_type, and _infer_schema functions to types.py - Update session.py to handle dict rows in _combine_data_and_schema and add schema inference branch in createDataFrame for list[dict] - Add _inferSchemaFromList method to SparkSession - Fix test_struct_column to use inferred field names instead of col0/col1 - Add test_dataframe_from_list_dicts test case
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Spark-like schema inference so SparkSession.createDataFrame() can accept a list of dict rows and infer column names/types consistently with PySpark behavior.
Changes:
- Implemented schema/type inference helpers in
sql/types.py(infer + merge + NullType detection). - Updated
SparkSession.createDataFrame()and row conversion to support dict rows and infer schema when none is provided. - Extended/adjusted fast Spark test coverage for struct columns and list-of-dicts DataFrame creation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| duckdb/experimental/spark/sql/types.py | Adds type mappings plus infer/merge helpers used to infer schemas across Python objects and rows. |
| duckdb/experimental/spark/sql/session.py | Infers schema for list inputs without explicit schema and aligns dict-row values to schema order. |
| tests/fast/spark/test_spark_column.py | Simplifies struct column test now that Row field names are inferred correctly. |
| tests/fast/spark/test_spark_dataframe.py | Adds coverage for createDataFrame from list of dicts (key order differences, missing/extra keys). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+158
to
+160
| elif isinstance(data, list) and data: | ||
| schema = self._inferSchemaFromList(data) | ||
| types, names = schema.extract_types_and_names() |
Comment on lines
+1266
to
+1284
| return ArrayType( | ||
| _merge_type( | ||
| a.elementType, | ||
| cast(ArrayType, b).elementType, | ||
| name="element in array %s" % name, | ||
| ), | ||
| True, | ||
| ) | ||
|
|
||
| elif isinstance(a, MapType): | ||
| return MapType( | ||
| _merge_type( | ||
| a.keyType, cast(MapType, b).keyType, name="key of map %s" % name | ||
| ), | ||
| _merge_type( | ||
| a.valueType, cast(MapType, b).valueType, name="value of map %s" % name | ||
| ), | ||
| True, | ||
| ) |
Comment on lines
+199
to
+201
| def _inferSchemaFromList( | ||
| self, data: Iterable[Any], names: Optional[List[str]] = None | ||
| ) -> StructType: |
Comment on lines
+215
to
+219
| if not data: | ||
| raise PySparkValueError( | ||
| error_class="CANNOT_INFER_EMPTY_SCHEMA", | ||
| message_parameters={}, | ||
| ) |
Comment on lines
+226
to
+238
| schema = reduce( | ||
| _merge_type, | ||
| ( | ||
| _infer_schema( | ||
| row, | ||
| names, | ||
| infer_dict_as_struct=infer_dict_as_struct, | ||
| infer_array_from_first_element=infer_array_from_first_element, | ||
| prefer_timestamp_ntz=prefer_timestamp_ntz, | ||
| ) | ||
| for row in data | ||
| ), | ||
| ) |
Comment on lines
+1317
to
+1325
| if key is not None and value is not None: | ||
| struct.add( | ||
| key, | ||
| _infer_type( | ||
| value, | ||
| infer_dict_as_struct, | ||
| infer_array_from_first_element, | ||
| prefer_timestamp_ntz, | ||
| ), |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #183.
Problem
When calling
spark.createDataFrame([{"col": value}, ...]), the Spark API failed to infer the schema from dict keys, unlike PySpark which handles this natively.Solution
Port schema inference logic from duckdb/duckdb#18051.
Changes
duckdb/experimental/spark/sql/types.py_type_mappingsand_array_type_mappings— dicts mapping Python types to Spark SQL DataTypes_has_nulltype()— checks for NullType anywhere in a schema tree_merge_type()— merges two DataTypes (used when inferring schema across multiple rows)_infer_type()— infers a DataType from a Python object_infer_schema()— infers a StructType schema from a dict/namedtuple/Row/objectduckdb/experimental/spark/sql/session.py_combine_data_and_schema()to handle dict rows (extract values in schema field order)createDataFrame()for list-of-dict input without explicit schema_inferSchemaFromList()method toSparkSessiontests/fast/spark/test_spark_column.pyUSE_ACTUAL_SPARKbranching intest_struct_column— Row field names are now correctly inferredtests/fast/spark/test_spark_dataframe.pytest_dataframe_from_list_dictscovering dicts with different key orders and missing/extra keys