ENH: add sort() method to StringArray and ArrowStringArray#65052
ENH: add sort() method to StringArray and ArrowStringArray#65052ssam18 wants to merge 11 commits into
Conversation
|
is the idea to add it for just these arrays or all EAs? I think @rhshadrach would want all EAs |
|
@jbrockmendel Thank you for the clarification. I have moved the implementation to the base ExtensionArray class so all EAs that support setitem get sort() for free. The numpy-backed and Arrow-backed overrides are kept for efficiency. |
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
|
can someone review this PR? |
rhshadrach
left a comment
There was a problem hiding this comment.
Looks good - just some requests on testing style. Also needs a whatsnew in 3.1.0rst under "Other Enhancements"
In pandas 3.x, Series.unique() started returning StringArray/ArrowStringArray instead of numpy.ndarray. Since numpy arrays have an in-place sort() method but extension arrays didn't, code that called .sort() on unique() results broke. Added sort() to NDArrayBackedExtensionArray and ArrowExtensionArray, and tests covering ascending, descending, NA placement, and the original repro.
1. Add whatsnew entry under Other enhancements in v3.1.0.rst 2. Add GH issue reference comments to new sort tests 3. Simplify NA handling in test_sort_with_na to use dtype.na_value directly 4. Parametrize test_sort_with_na on na_position to remove duplicated body
2ebe0fd to
0a477cf
Compare
|
How necessary are the subclass overrides? For NDArrayBacked in particular I don't expect it to make much difference |
|
A few questions, no objections. |
|
Hi, I noticed the discussion about moving the sort() tests to the extension array test suite. I’m new to contributing to pandas, but I’d be happy to help look into the existing extension test structure and check whether a common ExtensionArray.sort test can be added. Please let me know if this would be useful, and I can start exploring it. |
Per review feedback on GH#65052, hoist the in-place sort tests from the StringArray-specific test file into BaseMethodsTests so every ExtensionArray subclass exercises the inherited sort() method through data_for_sorting / data_missing_for_sorting fixtures.
SparseArray.__setitem__ raises TypeError, so the base ExtensionArray.sort() implementation (which does self[:] = self.take(...)) failed with a confusing internal error. Override sort() to raise NotImplementedError cleanly and assert that behavior in the extension test suite. Surfaced by the new shared sort tests in BaseMethodsTests.
JSONArray.__setitem__ in the extension test helper didn't iterate over slice keys, so any self[:] = ... path (including the inherited ExtensionArray.sort()) blew up with "slice object is not iterable". Expand slice keys to a range before dispatching. Three pre-existing xfail markers on test_setitem_slice* now XPASS, so remove or scope them accordingly.
…od-for-StringArray
|
Do we need to add |
Simplified test_sort_unique_result to construct the StringArray directly rather than going through a DataFrame, since the bug is just about unique followed by sort. Also listed ExtensionArray.sort in the extensions reference page next to argsort so the new public method is documented.
Fixes #64977
In pandas 3.x,
Series.unique()returnsStringArrayorArrowStringArrayinstead of a numpy array, but neither of these had a.sort()method that numpy arrays have always supported. This broke code that called.sort()on the result of.unique().The fix adds a
sort()method toNDArrayBackedExtensionArray(which coversStringArray) andArrowExtensionArray(which coversArrowStringArray). For numpy-backed arrays, the underlying_ndarrayis reordered in-place; for Arrow-backed arrays (which are immutable), the internal reference is swapped.Tests cover ascending and descending sort, NA placement with
na_position='first'/'last', and the exact repro from the issue.