Conversation
max-sixty
left a comment
There was a problem hiding this comment.
This seems extremely good!!
I left some refine-y comments, nothing preclusive.
What do others think? I don't use interpolate_na myself much so other may have views...
| | datetime.timedelta | ||
| ) = None, | ||
| limit_direction: LimitDirectionOptions = "both", | ||
| limit_area: LimitAreaOptions | None = None, |
There was a problem hiding this comment.
Refine-y but:
- are
interpolateandextrapolateclearer options? - is there a better name for the param? I'm not sure. "area" sounds 2D to me
There was a problem hiding this comment.
True. I adopted this nomenclature from pandas. What about limit_region, does this sound less 2D?
Personally, I like the 'inside'/'outside' options, 'interpolate' sounds close to linear interpolation to me...
|
|
||
| return bfill(self, dim, limit=limit) | ||
|
|
||
| def fill_gaps( |
There was a problem hiding this comment.
Any thoughts from others on the naming? Would .fill be insufficiently specific that it's filling na? Would fill_missing be clearer than fill_gaps?
There was a problem hiding this comment.
Open to any of those. .fill sounds very concise, but maybe this is easily confused with .ffill
There was a problem hiding this comment.
I think .fill could be quite nice, do others have a view?
There was a problem hiding this comment.
Maybe gap_filler instead, since this method does not actually fill the gaps.
I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later
mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)
There was a problem hiding this comment.
Good points! Just to explain:
gap_filler emphasizes the returned object type nicely! However, I choose fill_gaps because it fits the naming scheme of other object-returning functions better (e.g. rolling and coarsen are not called roller and coarser in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler, I will change accordingly.
The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~). If the mask is required, you can easily get it from the object:
mask=ds.gap_filler(...).mask
headtr1ck
left a comment
There was a problem hiding this comment.
Basically all the new methods are missing return types, please consider adding them.
The new class should be generic, I have added a few review comments here and there but more adoptions are needed.
Thanks a lot for the review of the typing! I`ll work through them asap and try to make all adoptions (probabily still need some help afterwards, though ...) |
|
@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors... |
I'm currently on holiday and it's difficult to debug these errors on the phone. |
|
There are a couple of outstanding questions about the interfaces; in particular:
Re typing — are the interfaces typed correctly? I think they are. Assuming those are correct, I would vote that it's fine to add ignores internally (ofc the internal checks are useful tests for the external interfaces, so ideally we'd fix those) What's the best way of resolving those outstanding questions @pydata/xarray ? I know I've been terrible on attending calls now I'm on PT, but maybe we could do it on one of those? (would be great to minimize the delay on these sorts of PRs — I think they're really valuable and would love to engender more...) |
|
One more naming decision:
|
|
Did you manage to discuss this this morning? |
@max-sixty Yes, @Ockenfuss explained what decisions are to be made and encouraged to have a look here. |
|
Hi @max-sixty, not too much community feedback yet :) What do you think, shall we have another look on the three open naming issues and then push this over the finish line? |
Yes let's do it! I would bias for action; if folks disagree then dissent welcome. What's remaining? Personally I think we should simplify this: #9402 (comment) + resolve any names. The external types should be correct; internally some ignores are OK... |
b5025ff to
72c76db
Compare
|
Ok, I think from my side this is ready for merge. I stayed with the fill_gaps and limit_area names for now. Check out the doc build pages for fill_gaps and the GapMask object to get a feeling for it and let me know if it feels right :) This PR leaves all existing fill function API untouched. If we want to bring some of the new limit arguments to the existing functions (like DataArray.fillna(limit=...)), this could be done later in separate PRs. TBD:
|
|
Looks great, I think this is indeed very close to landing. The one thing I'm stuck on is the concept of (Totally fine if others get this, I'm not the gatekeeper here, but do think it's important the API makes sense...) |
|
See this comment for an explanation, or just try out what happens :) For example: da=xr.DataArray([1, np.nan, np.nan, 2], dims=['x'])
da.fill_gaps('x', limit=1, limit_direction='backward').ffill()I agree, this is maybe not obvious. Here, I explained why I left it as is. If you want, I can try to keep track of the limit_direction argument upon mask creation and raise an exception, if the user applies ffill later. But this requires some tweaks in the code. Edit: I added a sentence to the fill_gaps doc page to explain the behaviour better. |
|
Yes, I see what it does, but (very very respectfully, with much admiration for the overall PR!), it doesn't make semantic sense to me. When I think "backward", I mean "look behind you, is there a gap? OK now fill it". I don't see how it can mean "start one step back, and then decide which direction to continue". Why only one step? I think we should try hard to have xarray's interfaces be understandable without much technical knowledge. (Indeed, I think there's probably a more precise way of describing my objection around edges and continuous vs. discrete, but hopefully this suffices). Would you object to removing the |
|
Ha, thanks a lot for the drawing! I appreciate it! FWIW the step there that didn't make sense to me is "one step backward" for
|
|
I went for option 3: Default to 'forward' in case of ffill and 'backward' in case of bfill. If this is not the case, an error is raised. (This is similar to what pandas is doing, check out the docstring here.) |
|
I'm late to the discussion (thanks for sticking with us for so long, even though we've been really, really slow!), but to me Other than that, the only suggestion I have is to look at the naming of the API (I guess I'm bikeshedding?): ds.fill_gaps(...).interpolate(...) # no need to call it `interpolate_na` since it already is in a namespace?kinda works, but ds.fill_gaps(...).ffill(...)
ds.fill_gaps(...).bfill(...)repeats "fill" and feels a bit awkward (it does fit with the ds.gaps(...).ffill(...)
ds.gaps(...).interpolate(...)or ds.na(...).ffill(...)
ds.na(...).interpolate(...)The latter may be a bit short and missing the semantic hint of "we're working on gaps", though. (And either way, these are suggestions, not requests, so feel free to reject them if it doesn't make too much sense or if it has already been discussed in a previous version of this PR). |
|
Thanks you, no worries, all feedback is welcome everytime!! ffill+direction backward issueYour understanding is correct. I have no immediate use case for it, but I thought initially why forbidding something that is possible. However, Max also thought this case may be confusing, so we decided to raise an error if attempting it now. NamesI am open to anything! I am with you that having "gap" somehow in the name helps for the understanding. Just "gaps" is also fine for me. Anything elseYou might wonder: Why a new fill_gaps API function and not just equip the existing functions with new arguments (e.g. for ffill: limit, limit_area, max_gap, limit_use_coordinate)? |
|
Great! For the name, a verb is consistent with the existing interface ( ...unless ...so maybe that could work? If others agree then I would be in favor of Unless we do decide to change the name, is anything else remaining? Last call for feedback @pydata/xarray ! |
|
How about |
for more information, see https://pre-commit.ci
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
…rpolate_na on a GapMask object
Raise error if this is not the case. This implies that the mask creation needs to be delayed until the filling stage.
53ee83f to
2cf6659
Compare
for more information, see https://pre-commit.ci
- adapt jupyter/ipython command to new notation in user guide
|
Hi @max-sixty, since a colleague asked me recently about gap filling in xarray, I remembered this PR. :) I think we were basically done, just the naming was open (see last comments). I rebased the development branch to the new main branch to bring it up to date (however, the docs build is currently failing). What do you think, any chance we are going to push this over the finish line? |

Improve gap filling
fill_gaps()to control gap length for all filling functions (ffill/bfill/interpolate_na/fillna)This PR involves a full implementation, documentation and corresponding tests. As discussed in #7665, a new API function
fill_gapswas introduced, to avoid breaking changes of existing code and keep the argument inflation of the filling functions to a minimum.TBD if accepted:
whats-new.rstapi.rstDeprecations
limitin interpolate_na now requires numbagg or bottleneck. So far, limit worked without bottleneck, max_gap required it (not documented). Reason: max_gap and limit now share the same codebase using ffill. Since no one ever complained about the missing documentation and the error message is pretty clear and easy to resolve ("No module named 'bottleneck"), I hope this might be acceptable.