Skip to content

SNOW-3172585: [Local Testing] Bug in the mock_substring() function with local testing in Snowpark Python #4091

@andresccu

Description

@andresccu

Please answer these questions before submitting your issue. Thanks!

1. What version of Python are you using?

Python 3.10.19 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 16:41:31) [MSC v.1929 64 bit (AMD64)]

2. What are the Snowpark Python and pandas versions in the environment?

pandas==2.2.3
snowflake-snowpark-python==1.34.0

3. What did you do?

Applied substring() on a DataFrame that had been previously filtered with using filter().
When rows are removed by a filter, the underlying pandas index becomes non-contiguous (e.g. [1, 2] instead of [0, 1, 2]).
The built-in local testing mock for substring returns a ColumnEmulator with a fresh 0-based index instead of preserving the original index from the input column.
When with_column (for example) merges the result back into the DataFrame, pandas performs an outer-join on the mismatched indices, producing extra rows filled with NaN.

You can reproduce this error executing the following test:

from snowflake.snowpark import Session, Row, DataFrame
from snowflake.snowpark.functions import col, lit, substring


def test_substring(session: Session):
    df: DataFrame = session.create_dataframe(
        [
            ["hello"],
            ["world"],
            ["snowflake"],
        ],
        schema=["string"],
    )

    # After filter, the internal pandas index becomes non-contiguous ([1, 2]).
    # mock_substring builds its result ColumnEmulator without index=base_expr.index,
    # so pandas introduces a spurious NaN row when with_column joins the result back.
    result = (
        df.filter(col("string") != lit("hello"))
        .with_column(
            "substring",
            substring(col("string"), lit(2), lit(3)),
        )
        .collect()
    )

    # Expected: [Row('world', 'orl'), Row('snowflake', 'now')]
    # Actual: 3 rows, added extra NaN/None row due to index mismatch in mock_substring
    assert len(result) == 2
    assert result[0]["SUBSTRING"] == "orl"
    assert result[1]["SUBSTRING"] == "now"

4. What did you expect to see?

The query should return exactly 2 rows with the substring correctly aligned to each row:

[
    Row(STRING='world', SUBSTRING='orl'),
    Row(STRING='snowflake', SUBSTRING='now')
]

Instead, the local testing framework returns 3 rows with misaligned data:

[
    Row(STRING='world', SUBSTRING='now'), 
    Row(STRING='snowflake', SUBSTRING=nan),
    Row(STRING=nan, SUBSTRING='orl'),
]

When the same code runs against a real Snowflake connection, SUBSTRING works fine because there is no pandas index involved. The bug is exclusively in the local testing mock.

This happens because the mock_substring function in snowflake.snowpark.mock._functions builds its result ColumnEmulator without preserving index=base_expr.index, so after any .filter() that makes the index non-contiguous, with_column produces None/NaN rows.


I'm willing to implement this feature and submit a pull request. The implementation would include:

  • mock_substring function() in src/snowflake/snowpark/mock/_functions.py
  • test_substring() test case in tests/mock/test_functions.py

Metadata

Metadata

Labels

bugSomething isn't workinglocal testingLocal Testing issues/PRsstatus-triage_doneInitial triage done, will be further handled by the driver team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions