SNOW-3172585: [Local Testing] Bug in the mock_substring() function with local testing in Snowpark Python

Please answer these questions before submitting your issue. Thanks!

**1. What version of Python are you using?**

Python 3.10.19 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 16:41:31) [MSC v.1929 64 bit (AMD64)]

**2. What are the Snowpark Python and pandas versions in the environment?**

pandas==2.2.3
snowflake-snowpark-python==1.34.0

**3. What did you do?**

Applied substring() on a DataFrame that had been previously filtered with using filter().
When rows are removed by a filter, the underlying pandas index becomes non-contiguous (e.g. [1, 2] instead of [0, 1, 2]). 
The built-in local testing mock for substring returns a ColumnEmulator with a fresh 0-based index instead of preserving the original index from the input column. 
When with_column (for example) merges the result back into the DataFrame, pandas performs an outer-join on the mismatched indices, producing extra rows filled with NaN.

You can reproduce this error executing the following test:

```
from snowflake.snowpark import Session, Row, DataFrame
from snowflake.snowpark.functions import col, lit, substring


def test_substring(session: Session):
    df: DataFrame = session.create_dataframe(
        [
            ["hello"],
            ["world"],
            ["snowflake"],
        ],
        schema=["string"],
    )

    # After filter, the internal pandas index becomes non-contiguous ([1, 2]).
    # mock_substring builds its result ColumnEmulator without index=base_expr.index,
    # so pandas introduces a spurious NaN row when with_column joins the result back.
    result = (
        df.filter(col("string") != lit("hello"))
        .with_column(
            "substring",
            substring(col("string"), lit(2), lit(3)),
        )
        .collect()
    )

    # Expected: [Row('world', 'orl'), Row('snowflake', 'now')]
    # Actual: 3 rows, added extra NaN/None row due to index mismatch in mock_substring
    assert len(result) == 2
    assert result[0]["SUBSTRING"] == "orl"
    assert result[1]["SUBSTRING"] == "now"
```

**4. What did you expect to see?**

The query should return exactly 2 rows with the substring correctly aligned to each row:
```
[
    Row(STRING='world', SUBSTRING='orl'),
    Row(STRING='snowflake', SUBSTRING='now')
]
```

Instead, the local testing framework returns 3 rows with misaligned data:
```
[
    Row(STRING='world', SUBSTRING='now'), 
    Row(STRING='snowflake', SUBSTRING=nan),
    Row(STRING=nan, SUBSTRING='orl'),
]
```

When the same code runs against a real Snowflake connection, SUBSTRING works fine because there is no pandas index involved. **The bug is exclusively in the local testing mock.**

This happens because the mock_substring function in snowflake.snowpark.mock._functions builds its result ColumnEmulator without preserving index=base_expr.index, so after any .filter() that makes the index non-contiguous, with_column produces None/NaN rows.

---
I'm willing to implement this feature and submit a pull request. The implementation would include:

- [ ] mock_substring function() in src/snowflake/snowpark/mock/_functions.py
- [ ] test_substring() test case in tests/mock/test_functions.py


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-3172585: [Local Testing] Bug in the mock_substring() function with local testing in Snowpark Python #4091

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SNOW-3172585: [Local Testing] Bug in the mock_substring() function with local testing in Snowpark Python #4091

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions