BUG: DataFrame reductions losing EA dtypes#52261
BUG: DataFrame reductions losing EA dtypes#52261jbrockmendel wants to merge 4 commits intopandas-dev:mainfrom
Conversation
|
gentle ping @rhshadrach not looking to merge anytime soon but for thoughts on the approach |
There was a problem hiding this comment.
thanks for the ping - sorry this fell off my radar. I really like the approach here - It looks like this is close to impacting other ops like min, e.g.
df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}, dtype='Int64')
result = df.min()
print(result) # currently is int64, should be Int64 I think
However, currently after making the values 2D, we do values[~mask] which brings it back to 1D again. I think we can instead do np.where(mask, fill_value, values). This is similar to how nanops works.
any other particular cases need testing?
Because of the above - it's not exactly clear what ops this is impacting. Could maybe add a basic test across the reduction ops and xfail any that lose EA dtype?
| try: | ||
| return values._reduce(name, skipna=skipna, keepdims=True, **kwds) | ||
| except (TypeError, ValueError): | ||
| # no keepdims keyword yet; ValueError gets raised by | ||
| # util validator functions | ||
| return values._reduce(name, skipna=skipna, **kwds) |
There was a problem hiding this comment.
This will just be the try portion when fully implemented, yea?
There was a problem hiding this comment.
right. there would need to be a deprecation cycle to allow 3rd party EAs to catch up
| if isinstance(res, (np.ndarray, ExtensionArray)): | ||
| # keepdims worked! | ||
| result_arrays.append(res) | ||
| else: | ||
| # TODO NaT doesn't preserve dtype, so we need to ensure to create | ||
| # a timedelta result array if original was timedelta | ||
| # what if datetime results in timedelta? (eg std) | ||
| dtype = arr.dtype if res is NaT else None | ||
| result_arrays.append(sanitize_array([res], None, dtype=dtype)) |
There was a problem hiding this comment.
Similar - is the plan to be able to remove this else?
|
I've looked into this PR after you pointed it out. I had missed this PR unfortunately... I like the keepdims on the arrays approach, it's a natural way to way to keep the dtype information that is being lost in reductions to scalars, rather than doing it on the frame as in my approach. I'm not a super fan of picking I think adding a Also, because I did #52707 from the point of solving #40669, I had a focus on frames with I can take a closer look if the above approach works, if that's ok with you? |
sounds good |
|
Closing in favor of #52788 |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.In the absence of 2D EAs, we need reductions to sometimes pretend the EA is 2D. Enter "keepdims" adapted from numpy reductions.
This is still pretty ugly. I'm open to ideas to clean it up.
cc @rhshadrach any other particular cases need testing?