-
Notifications
You must be signed in to change notification settings - Fork 81
Open
Description
Problem
The TableModel._validate_table_annotation_metadata() method rejects columns using pandas' modern StringDtype for the instance_key column, even though these are valid string columns. This forces users to use the older object dtype ('O'), which goes against pandas best practices since version 1.0.
Current Behavior
The validator at spatialdata/models/models.py:1016-1030 checks:
if (dtype := data.obs[attr[self.INSTANCE_KEY]].dtype) not in [
int,
np.int16,
np.int32,
np.int64,
np.uint16,
np.uint32,
np.uint64,
"O",
] or (dtype == "O" and (val_dtype := type(data.obs[attr[self.INSTANCE_KEY]].iloc[0])) is not str):
dtype = dtype if dtype != "O" else val_dtype
raise TypeError(
f"Only int, np.int16, np.int32, np.int64, uint equivalents or string allowed as dtype for "
f"instance_key column in obs. Dtype found to be {dtype}"
)This rejects pd.StringDtype(), which is what you get when using .astype(str) with modern pandas defaults.
Error Message
TypeError: Only int, np.int16, np.int32, np.int64, uint equivalents or string allowed as
dtype for instance_key column in obs. Dtype found to be <class 'pandas.core.arrays.string_.StringDtype'>Expected Behavior
The validator should accept:
pd.StringDtype()- modern pandas string typepd.CategoricalDtypewith string categories (common use case)- Current accepted types (object dtype with strings, integers)
Reproduction
import pandas as pd
import anndata as ad
from spatialdata.models import TableModel
import numpy as np
# Create simple AnnData with StringDtype
adata = ad.AnnData(X=np.array([[1, 2], [3, 4]]))
adata.obs["cell_id"] = pd.array(["cell_1", "cell_2"], dtype="string") # StringDtype
adata.obs["region"] = pd.Categorical(["region_A", "region_A"])
# This fails with TypeError
table = TableModel.parse(
adata,
region="region_A",
region_key="region",
instance_key="cell_id",
)Proposed Solution
Update the validation to accept StringDtype and potentially categorical with string categories:
import pandas as pd
# Check if it's a string-like dtype
dtype = data.obs[attr[self.INSTANCE_KEY]].dtype
is_valid = (
dtype in [int, np.int16, np.int32, np.int64, np.uint16, np.uint32, np.uint64, "O"]
or isinstance(dtype, pd.StringDtype)
or (isinstance(dtype, pd.CategoricalDtype) and dtype.categories.dtype == "O")
)
if dtype == "O":
# Additional check for object dtype
val_dtype = type(data.obs[attr[self.INSTANCE_KEY]].iloc[0])
is_valid = is_valid and val_dtype is str
if not is_valid:
raise TypeError(...)Impact
This affects:
- All spatialdata-io readers that use
.astype(str)(modern pandas default) - Users following pandas best practices for string handling
- Any downstream code that creates tables with string instance keys
Workaround
Currently, users must explicitly use .astype("object") instead of .astype(str), which is not ideal for long-term maintainability.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels