-
Notifications
You must be signed in to change notification settings - Fork 145
Description
Please answer these questions before submitting your issue. Thanks!
1. What version of Python are you using?
Python 3.10.19 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 16:41:31) [MSC v.1929 64 bit (AMD64)]
2. What are the Snowpark Python and pandas versions in the environment?
pandas==2.2.3
snowflake-snowpark-python==1.34.0
3. What did you do?
Applied substring() on a DataFrame that had been previously filtered with using filter().
When rows are removed by a filter, the underlying pandas index becomes non-contiguous (e.g. [1, 2] instead of [0, 1, 2]).
The built-in local testing mock for substring returns a ColumnEmulator with a fresh 0-based index instead of preserving the original index from the input column.
When with_column (for example) merges the result back into the DataFrame, pandas performs an outer-join on the mismatched indices, producing extra rows filled with NaN.
You can reproduce this error executing the following test:
from snowflake.snowpark import Session, Row, DataFrame
from snowflake.snowpark.functions import col, lit, substring
def test_substring(session: Session):
df: DataFrame = session.create_dataframe(
[
["hello"],
["world"],
["snowflake"],
],
schema=["string"],
)
# After filter, the internal pandas index becomes non-contiguous ([1, 2]).
# mock_substring builds its result ColumnEmulator without index=base_expr.index,
# so pandas introduces a spurious NaN row when with_column joins the result back.
result = (
df.filter(col("string") != lit("hello"))
.with_column(
"substring",
substring(col("string"), lit(2), lit(3)),
)
.collect()
)
# Expected: [Row('world', 'orl'), Row('snowflake', 'now')]
# Actual: 3 rows, added extra NaN/None row due to index mismatch in mock_substring
assert len(result) == 2
assert result[0]["SUBSTRING"] == "orl"
assert result[1]["SUBSTRING"] == "now"
4. What did you expect to see?
The query should return exactly 2 rows with the substring correctly aligned to each row:
[
Row(STRING='world', SUBSTRING='orl'),
Row(STRING='snowflake', SUBSTRING='now')
]
Instead, the local testing framework returns 3 rows with misaligned data:
[
Row(STRING='world', SUBSTRING='now'),
Row(STRING='snowflake', SUBSTRING=nan),
Row(STRING=nan, SUBSTRING='orl'),
]
When the same code runs against a real Snowflake connection, SUBSTRING works fine because there is no pandas index involved. The bug is exclusively in the local testing mock.
This happens because the mock_substring function in snowflake.snowpark.mock._functions builds its result ColumnEmulator without preserving index=base_expr.index, so after any .filter() that makes the index non-contiguous, with_column produces None/NaN rows.
I'm willing to implement this feature and submit a pull request. The implementation would include:
- mock_substring function() in src/snowflake/snowpark/mock/_functions.py
- test_substring() test case in tests/mock/test_functions.py