-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-42018: Add numpy.StringDType support #48391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
|
I have no idea if this is the way to implement numpy compat in arrow. Also I’ll add a better test case for the na_value param. |
Did you review the code to ensure it was/looked correct? |
|
@pitrou yes! I don’t work with C++ code professionally, so I lack the knowledge to know if eg the numpy 2.0+ feature usage is actually good or ridiculous. This was my first attempt and experiment with using AI for OSS development. The issue turned out to be more complex than I initially expected. The PR is still small enough, so I’d appreciate the feedback if it’s good direction and I totally understand if you say this is simply not reviewable and would waste the time of the participants. |
|
I might try to take a look but I will first have to get acquainted with the StringDType. |
|
@jorisvandenbossche might have insight. |
|
cc @ngoldbaum |
| inline npy_string_allocator* ArrowNpyString_acquire_allocator( | ||
| const PyArray_StringDTypeObject* descr) { | ||
| using Func = npy_string_allocator* (*)(const PyArray_StringDTypeObject*); | ||
| auto func = reinterpret_cast<Func>(PyArray_API[316]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these functions exposed or are they only accessible through the PyArray_API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reimplementation of the public api as the numpy api version used in pyarrow was below <2.0 and these were “hidden”. This way the code compiles with both old and new numpy. I didn’t find a better example in pyarrow how to onboard new API. The alternative is dropping the numpy 1.26 support (which might be allowed if pyarrow follows SPEC 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is unfortunately right, I think, since you can't put this in a different compilation unit with NPY_TARGET_VERSION=NPY_2_0_API_VERSION and you can't switch NPY_TARGET_VERSION within the same compilation unit.
Maybe for these specific functions that was actually unnecessary (and if you/@ngoldbaum likes, we could make them always defined in 2.5, it would just segfault if you ever use it, but we/I missed that it is impossible for there to be something to use it on).
That way, it might only be available if build with NumPy 2.5, but that is probably OK in practice, since that is what official builds will use (except maybe for some old Python version).
That said NPY_ABI_VERSION >= 0x02000000 isn't good here. If anything it should be NPY_FEATURE_VERSION (which you can enforce via NPY_TARGET_VERSION to indicate a minimal version of NumPy you support at runtime).
While think it's correct to hack it in this vain, I would suggest to guard it in a way that uses the custom version only when necessary and makes it obvious how to clean it up in the future.
(Even just copy-paste the C definitions with a #ifndef ...!?)
If NumPy offers the required defines (which it will at least on newer versions or with NPY_TARGET_VERSION=NPY_2_0_API_VERSION it is better to stop using it, there is no guarantee it will remain correct for ever.
|
I can take a look at this but I'll keep in mind an AI wrote it... |
|
|
||
| #if NPY_ABI_VERSION >= 0x02000000 | ||
| bool IsStringDType(PyArray_Descr* descr) { | ||
| // NumPy's variable-width StringDType exposes a dedicated dtype number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total nit: this comment doesn't really add anything - all dtypes shipped by NumPy expose a character code.
| #include "arrow/python/type_traits.h" | ||
| #include "arrow/python/vendored/pythoncapi_compat.h" | ||
|
|
||
| #if NPY_ABI_VERSION >= 0x02000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct. Isn't what's important that NPY_TARGET_VERSION > NPY_2_0_API_VERSION? Ditto for similar macro usage below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question whether is whether arrow cares to support --no-build-isolation builds in an env that has NumPy 1.x installed.
Quite a few projects do, even though any other (including official) build uses 2.x anyway (unless you are building for ancient Python versions maybe).
On such builds, NPY_VSTRING isn't defined so that is probably the only place where this (or a similar pattern) is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also just #define NPY_VSTRING 2056 if it isn't defined in an old numpy version. Technically we could change that value in NumPy but I seriously doubt we would.
| int (*)(npy_string_allocator*, const npy_packed_static_string*, npy_static_string*); | ||
| auto func = reinterpret_cast<Func>(PyArray_API[313]); | ||
| return func(allocator, packed, out); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I hate this. Please don't do this.
Can't this code just be conditionally defined based on NPY_TARGET_API?
| ConvertPySequence(reinterpret_cast<PyObject*>(arr_), | ||
| reinterpret_cast<PyObject*>(mask_), py_options, pool_)); | ||
| out_arrays_ = chunked_array->chunks(); | ||
| return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be very slow. Also the only way to get a StringDType array from NumPy is if at runtime you're using a NumPy newer than 2.0. So the only way you enter this code is if you're building PyArrow using NumPy 1.x but then want to use it at runtime with NumPy 2.0. Seems kinda silly? Why not just build with NumPy 2.0. You don't need to build NumPy with the oldest supported NumPy anymore.
| Status NumPyConverter::AppendStringDTypeValues(Builder* builder) { | ||
| auto* descr = reinterpret_cast<PyArray_StringDTypeObject*>(dtype_); | ||
|
|
||
| npy_string_allocator* allocator = ArrowNpyString_acquire_allocator(descr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI for other reviewers: this locks a mutex internally in NumPy.
| npy_static_string value = {0, nullptr}; | ||
|
|
||
| auto append_value = [&](const npy_packed_static_string* packed) -> Status { | ||
| int rc = ArrowNpyString_load(allocator, packed, &value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like referring to the return value of NpyString_load as is_null, since it indicates whether or not you've loaded a missing string if it's true.
|
|
||
| npy_static_string value = {0, nullptr}; | ||
|
|
||
| auto append_value = [&](const npy_packed_static_string* packed) -> Status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally I find this closure a little hard to follow but I'm not a big C++ guy. I'd just make it a helper function that's defined outside this function.
|
|
||
| dtype = StringDType() | ||
|
|
||
| arr = np.array(["some", "strings"], dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably include some strings that cross NumPy's internal "medium string" and "long string" size limits. See NEP 55 for more details on the specifics there. I'd also probably include some unicode chracters too.
|
@ngoldbaum thanks for the review. I’ll address them as soon as I figure out what to do with the numpy versions. Also I appreciate the open minded perspective for the AI, I did my best to only submit something what works and doesn’t have unnecessary code. |
Rationale for this change
Implement #42018
What changes are included in this PR?
Conversion in numpy->arrow direction with multiple string types
Are these changes tested?
Two basic conversion tests added
Are there any user-facing changes?
Yes, adds support to numpy.StringDType as source
I'm not sure what the AI policy is for apache/arrow, this PR was created using OpenAI Codex.
cc @jorisvandenbossche as he opened the original issue