Skip to content

ENH: add sort() method to StringArray and ArrowStringArray#65052

Open
ssam18 wants to merge 11 commits into
pandas-dev:mainfrom
ssam18:BUG-GH64977-sort-method-for-StringArray
Open

ENH: add sort() method to StringArray and ArrowStringArray#65052
ssam18 wants to merge 11 commits into
pandas-dev:mainfrom
ssam18:BUG-GH64977-sort-method-for-StringArray

Conversation

@ssam18
Copy link
Copy Markdown

@ssam18 ssam18 commented Apr 3, 2026

Fixes #64977
In pandas 3.x, Series.unique() returns StringArray or ArrowStringArray instead of a numpy array, but neither of these had a .sort() method that numpy arrays have always supported. This broke code that called .sort() on the result of .unique().

The fix adds a sort() method to NDArrayBackedExtensionArray (which covers StringArray) and ArrowExtensionArray (which covers ArrowStringArray). For numpy-backed arrays, the underlying _ndarray is reordered in-place; for Arrow-backed arrays (which are immutable), the internal reference is swapped.

Tests cover ascending and descending sort, NA placement with na_position='first'/'last', and the exact repro from the issue.

@jbrockmendel
Copy link
Copy Markdown
Member

is the idea to add it for just these arrays or all EAs? I think @rhshadrach would want all EAs

@ssam18
Copy link
Copy Markdown
Author

ssam18 commented Apr 4, 2026

@jbrockmendel Thank you for the clarification. I have moved the implementation to the base ExtensionArray class so all EAs that support setitem get sort() for free. The numpy-backed and Arrow-backed overrides are kept for efficiency.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions Bot added the Stale label May 5, 2026
@ssam18
Copy link
Copy Markdown
Author

ssam18 commented May 5, 2026

can someone review this PR?

Copy link
Copy Markdown
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just some requests on testing style. Also needs a whatsnew in 3.1.0rst under "Other Enhancements"

Comment thread pandas/core/arrays/base.py
Comment thread pandas/tests/arrays/string_/test_string.py Outdated
Comment thread pandas/tests/arrays/string_/test_string.py Outdated
Comment thread pandas/tests/arrays/string_/test_string.py Outdated
@rhshadrach rhshadrach changed the title BUG: add sort() method to StringArray and ArrowStringArray ENH: add sort() method to StringArray and ArrowStringArray May 5, 2026
ssam18 added 5 commits May 5, 2026 19:36
In pandas 3.x, Series.unique() started returning StringArray/ArrowStringArray
instead of numpy.ndarray. Since numpy arrays have an in-place sort() method
but extension arrays didn't, code that called .sort() on unique() results
broke. Added sort() to NDArrayBackedExtensionArray and ArrowExtensionArray,
and tests covering ascending, descending, NA placement, and the original repro.
1. Add whatsnew entry under Other enhancements in v3.1.0.rst
2. Add GH issue reference comments to new sort tests
3. Simplify NA handling in test_sort_with_na to use dtype.na_value directly
4. Parametrize test_sort_with_na on na_position to remove duplicated body
@ssam18 ssam18 force-pushed the BUG-GH64977-sort-method-for-StringArray branch from 2ebe0fd to 0a477cf Compare May 6, 2026 00:41
Copy link
Copy Markdown
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach requested a review from jbrockmendel May 7, 2026 20:30
@rhshadrach rhshadrach added ExtensionArray Extending pandas with custom dtypes or arrays. Sorting e.g. sort_index, sort_values labels May 7, 2026
@rhshadrach rhshadrach added this to the 3.1 milestone May 7, 2026
Comment thread pandas/tests/arrays/string_/test_string.py Outdated
@jbrockmendel
Copy link
Copy Markdown
Member

How necessary are the subclass overrides? For NDArrayBacked in particular I don't expect it to make much difference

@jbrockmendel
Copy link
Copy Markdown
Member

A few questions, no objections.

@Parth-Dholariya
Copy link
Copy Markdown

Hi, I noticed the discussion about moving the sort() tests to the extension array test suite. I’m new to contributing to pandas, but I’d be happy to help look into the existing extension test structure and check whether a common ExtensionArray.sort test can be added.

Please let me know if this would be useful, and I can start exploring it.

ssam18 added 4 commits May 15, 2026 15:31
Per review feedback on GH#65052, hoist the in-place sort tests from
the StringArray-specific test file into BaseMethodsTests so every
ExtensionArray subclass exercises the inherited sort() method through
data_for_sorting / data_missing_for_sorting fixtures.
SparseArray.__setitem__ raises TypeError, so the base
ExtensionArray.sort() implementation (which does self[:] = self.take(...))
failed with a confusing internal error. Override sort() to raise
NotImplementedError cleanly and assert that behavior in the extension
test suite.

Surfaced by the new shared sort tests in BaseMethodsTests.
JSONArray.__setitem__ in the extension test helper didn't iterate over
slice keys, so any self[:] = ... path (including the inherited
ExtensionArray.sort()) blew up with "slice object is not iterable".
Expand slice keys to a range before dispatching.

Three pre-existing xfail markers on test_setitem_slice* now XPASS, so
remove or scope them accordingly.
Comment thread pandas/tests/arrays/string_/test_string.py Outdated
@jbrockmendel
Copy link
Copy Markdown
Member

Do we need to add sort in docs/source/reference/extensions.rst?

Simplified test_sort_unique_result to construct the StringArray directly rather than going through a DataFrame, since the bug is just about unique followed by sort. Also listed ExtensionArray.sort in the extensions reference page next to argsort so the new public method is documented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Sorting e.g. sort_index, sort_values

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: AttributeError: 'StringArray' object has no attribute 'sort'

4 participants