TableModel validator rejects pandas StringDtype for instance_key column

## Problem

The `TableModel._validate_table_annotation_metadata()` method rejects columns using pandas' modern `StringDtype` for the `instance_key` column, even though these are valid string columns. This forces users to use the older object dtype (`'O'`), which goes against pandas best practices since version 1.0.

## Current Behavior

The validator at `spatialdata/models/models.py:1016-1030` checks:

```python
if (dtype := data.obs[attr[self.INSTANCE_KEY]].dtype) not in [
    int,
    np.int16,
    np.int32,
    np.int64,
    np.uint16,
    np.uint32,
    np.uint64,
    "O",
] or (dtype == "O" and (val_dtype := type(data.obs[attr[self.INSTANCE_KEY]].iloc[0])) is not str):
    dtype = dtype if dtype != "O" else val_dtype
    raise TypeError(
        f"Only int, np.int16, np.int32, np.int64, uint equivalents or string allowed as dtype for "
        f"instance_key column in obs. Dtype found to be {dtype}"
    )
```

This rejects `pd.StringDtype()`, which is what you get when using `.astype(str)` with modern pandas defaults.

## Error Message

```python
TypeError: Only int, np.int16, np.int32, np.int64, uint equivalents or string allowed as
dtype for instance_key column in obs. Dtype found to be <class 'pandas.core.arrays.string_.StringDtype'>
```

## Expected Behavior

The validator should accept:
- `pd.StringDtype()` - modern pandas string type
- `pd.CategoricalDtype` with string categories (common use case)
- Current accepted types (object dtype with strings, integers)

## Reproduction

```python
import pandas as pd
import anndata as ad
from spatialdata.models import TableModel
import numpy as np

# Create simple AnnData with StringDtype
adata = ad.AnnData(X=np.array([[1, 2], [3, 4]]))
adata.obs["cell_id"] = pd.array(["cell_1", "cell_2"], dtype="string")  # StringDtype
adata.obs["region"] = pd.Categorical(["region_A", "region_A"])

# This fails with TypeError
table = TableModel.parse(
    adata,
    region="region_A",
    region_key="region",
    instance_key="cell_id",
)
```

## Proposed Solution

Update the validation to accept `StringDtype` and potentially categorical with string categories:

```python
import pandas as pd

# Check if it's a string-like dtype
dtype = data.obs[attr[self.INSTANCE_KEY]].dtype
is_valid = (
    dtype in [int, np.int16, np.int32, np.int64, np.uint16, np.uint32, np.uint64, "O"]
    or isinstance(dtype, pd.StringDtype)
    or (isinstance(dtype, pd.CategoricalDtype) and dtype.categories.dtype == "O")
)

if dtype == "O":
    # Additional check for object dtype
    val_dtype = type(data.obs[attr[self.INSTANCE_KEY]].iloc[0])
    is_valid = is_valid and val_dtype is str

if not is_valid:
    raise TypeError(...)
```

## Impact

This affects:
- All spatialdata-io readers that use `.astype(str)` (modern pandas default)
- Users following pandas best practices for string handling
- Any downstream code that creates tables with string instance keys

## Workaround

Currently, users must explicitly use `.astype("object")` instead of `.astype(str)`, which is not ideal for long-term maintainability.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TableModel validator rejects pandas StringDtype for instance_key column #1062

Problem

Current Behavior

Error Message

Expected Behavior

Reproduction

Proposed Solution

Impact

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TableModel validator rejects pandas StringDtype for instance_key column #1062

Description

Problem

Current Behavior

Error Message

Expected Behavior

Reproduction

Proposed Solution

Impact

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions