Skip to content

TableModel validator rejects pandas StringDtype for instance_key column #1062

@jlr13

Description

@jlr13

Problem

The TableModel._validate_table_annotation_metadata() method rejects columns using pandas' modern StringDtype for the instance_key column, even though these are valid string columns. This forces users to use the older object dtype ('O'), which goes against pandas best practices since version 1.0.

Current Behavior

The validator at spatialdata/models/models.py:1016-1030 checks:

if (dtype := data.obs[attr[self.INSTANCE_KEY]].dtype) not in [
    int,
    np.int16,
    np.int32,
    np.int64,
    np.uint16,
    np.uint32,
    np.uint64,
    "O",
] or (dtype == "O" and (val_dtype := type(data.obs[attr[self.INSTANCE_KEY]].iloc[0])) is not str):
    dtype = dtype if dtype != "O" else val_dtype
    raise TypeError(
        f"Only int, np.int16, np.int32, np.int64, uint equivalents or string allowed as dtype for "
        f"instance_key column in obs. Dtype found to be {dtype}"
    )

This rejects pd.StringDtype(), which is what you get when using .astype(str) with modern pandas defaults.

Error Message

TypeError: Only int, np.int16, np.int32, np.int64, uint equivalents or string allowed as
dtype for instance_key column in obs. Dtype found to be <class 'pandas.core.arrays.string_.StringDtype'>

Expected Behavior

The validator should accept:

  • pd.StringDtype() - modern pandas string type
  • pd.CategoricalDtype with string categories (common use case)
  • Current accepted types (object dtype with strings, integers)

Reproduction

import pandas as pd
import anndata as ad
from spatialdata.models import TableModel
import numpy as np

# Create simple AnnData with StringDtype
adata = ad.AnnData(X=np.array([[1, 2], [3, 4]]))
adata.obs["cell_id"] = pd.array(["cell_1", "cell_2"], dtype="string")  # StringDtype
adata.obs["region"] = pd.Categorical(["region_A", "region_A"])

# This fails with TypeError
table = TableModel.parse(
    adata,
    region="region_A",
    region_key="region",
    instance_key="cell_id",
)

Proposed Solution

Update the validation to accept StringDtype and potentially categorical with string categories:

import pandas as pd

# Check if it's a string-like dtype
dtype = data.obs[attr[self.INSTANCE_KEY]].dtype
is_valid = (
    dtype in [int, np.int16, np.int32, np.int64, np.uint16, np.uint32, np.uint64, "O"]
    or isinstance(dtype, pd.StringDtype)
    or (isinstance(dtype, pd.CategoricalDtype) and dtype.categories.dtype == "O")
)

if dtype == "O":
    # Additional check for object dtype
    val_dtype = type(data.obs[attr[self.INSTANCE_KEY]].iloc[0])
    is_valid = is_valid and val_dtype is str

if not is_valid:
    raise TypeError(...)

Impact

This affects:

  • All spatialdata-io readers that use .astype(str) (modern pandas default)
  • Users following pandas best practices for string handling
  • Any downstream code that creates tables with string instance keys

Workaround

Currently, users must explicitly use .astype("object") instead of .astype(str), which is not ideal for long-term maintainability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions