Skip to content

SNOW-3192256: Support XML infer schema#4123

Open
sfc-gh-mayliu wants to merge 6 commits intomainfrom
mayliu-SNOW-3192256-xml-inferSchema
Open

SNOW-3192256: Support XML infer schema#4123
sfc-gh-mayliu wants to merge 6 commits intomainfrom
mayliu-SNOW-3192256-xml-inferSchema

Conversation

@sfc-gh-mayliu
Copy link
Collaborator

@sfc-gh-mayliu sfc-gh-mayliu commented Mar 14, 2026

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-3192256

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

    Please write a short description of how your code change solves the related issue.

Published PR to trigger merge gate for test results, but this PR is still a WIP. Will add google doc link to design doc.

… fix attribute and value structType across children elements
return True


def _validate_row_for_type_mismatch(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function being called per row after each call of element_to_dict_or_str ? I am a little concerning that whether this is going to impact the performance. Do you think it is possible to put this logic into element_to_dict_or_str so that we don't have to traverse each element of the row again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, _validate_row_for_type_mismatch is O(n) where n is the number of schema fields/columns; whereas element_to_dict_or_str traverses every element recursively. Thus, _validate_row_for_type_mismatch's performance impact is minimal compared it.

From contextual point of view, _validate_row_for_type_mismatch needs to validate against the resulting transformed dict, so it'd be better to semantically and sequentially separate these functions.

@sfc-gh-mayliu sfc-gh-mayliu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Mar 17, 2026
@sfc-gh-mayliu sfc-gh-mayliu requested a review from a team March 17, 2026 00:16
if corrupt_col_name in df_columns:
corrupt_ref = df[single_quote(corrupt_col_name)]
cols.append(
corrupt_ref.cast(StringType()).alias(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the type for the column _corrupt_record? -- I'm wondering 1) if we need cast 2) if cast is always needed can the string conversion logic be moved into the UDTF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_corrupt_record will always be stringType, and xml_reader.py UDTF always returns VARIANT for all columns including _corrupt_record. If user wants XML to return typed columns (guarded by if effective_schema is not None), the cast has to happen in dataframe_reader.py, can't be moved into UDTF.

def _resolve_xml_file_for_udtf(self, local_file_path: str) -> str:
"""Return the UDTF file path, uploading to a temp stage in stored procedures."""
if is_in_stored_procedure(): # pragma: no cover
temp_stage = random_name_for_temp_object(TempObjectType.STAGE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need an upload here? is it because the server doesn't have the latest implementation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upload is for sproc execution, which needs the UDTF file to be on stage for the UDTF register_from_file call to work later. Local execution does not need file upload.

sql_create_temp_stage = (
f"create temp stage if not exists {temp_stage} {XML_READER_SQL_COMMENT}"
)
self._session.sql(sql_create_temp_stage, _emit_ast=False).collect(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snowpark session has a get_session_stage method which creates a temp stage for the session, are we able to reuse that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can, I can reuse it in the next commit

except IndexError:
raise ValueError(f"{path} does not exist")

num_workers = min(16, file_size // DEFAULT_CHUNK_SIZE + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the num_workers config inheriting from the ingestion logic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's XML-specific. Previous perf benchmark showed UDTF parallelism plateau at 16 workers, so this is inspired by the original Snowpark UDTF design here

num_workers = min(16, file_size // DEFAULT_CHUNK_SIZE + 1)

partial_schema = type_string_to_type_object(schema_str)
except Exception:
continue
if not isinstance(partial_schema, StructType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the case that a partial_schema is returned?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results will contain up to 16 rows, each mapping to the schema inferred from uniformly split byte range in file. partial_schema captures the schema from each of the 16 workers, and individually merge/widen the previously merged_schema.

If you were wondering when the exception happens -- realistically the exception should never fire, since xml_schema_inference.py handles types in a tight roundtrip (_case_preserving_simple_string and type_string_to_type_object). This is a defensive guard to ensure one bad worker result doesn't crash the entire schema inference flow.

def _can_cast_to_type(value: str, target_type: DataType) -> bool:
if isinstance(target_type, StringType):
return True
if isinstance(target_type, LongType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this cover all types -- I don't see decimal type here, so does it mean we don't need to handle the decimal type here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, Spark by default doesn't infer decimal for XML. Spark's DecimalType inference is gated behind options.prefersDecimal, which defaults to false.
case v if options.prefersDecimal && decimalTry.isDefined => decimalTry.get

column_name_of_corrupt_record,
)

if row is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to confirm

  1. the behavior in spark that a mismatch row is skipped instead of being presented as a null row?
  2. does it change the current snowpark behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Only when mode=DROPMALFORMED, the row is skipped. When mode=PERMISSIVE, the bad fields are nullified and added to _corrupt_record. When mode=FAILFAST, raises an error. _validate_row_for_type_mismatch handles these behaviors upon checking mode
  2. Original behavior without inferSchema or custom schema is preserved, since no VARIANT casting was needed. This does not change custom schema behavior, but prevent custom schema from raising SQL casting errors where it shouldn't.

ignore_surrounding_whitespace,
row_validation_xsd_path=row_validation_xsd_path,
result_template=result_template,
schema_type=schema_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change the existing behavior in snowpark?
previously we don't do _validate_row_for_type_mismatch, is it always doing _validate_row_for_type_mismatch now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_row_for_type_mismatch is guarded by if schema_type is not None, which only applies to custom schema and infer schema cases. It enhances custom schema behavior, and leaves the original default Variant-returning xml parser behaviors unchanged.

This fixes an issue that would previously throw exception for custom schema and now infer schema. If a schema that mismatches input data type is provided, original behavior would throw SQL exception failure to cast.

return True
except (ValueError, TypeError):
return False
if isinstance(target_type, TimestampType):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, this function will return True when the target type is not in the check(not StringType, LongType, DoubleType ....)
For example, if the target type is TimeType, this function always return True, is this expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants