[c-api][python-package][R-package] expose feature num bin#5048
[c-api][python-package][R-package] expose feature num bin#5048StrikerRUS merged 12 commits intomicrosoft:masterfrom
Conversation
StrikerRUS
left a comment
There was a problem hiding this comment.
Thanks a lot for this PR!
LGTM for the Python part and just one minor suggestion for the test.
Also, could you please open a new issue or rename the old original one requesting to add the same functionality into the R-package?
|
I'll include the changes for the R package |
shiyu1994
left a comment
There was a problem hiding this comment.
Thank you! Very nice test cases.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Yeah, please check #4829 as example. New function should be registered here: LightGBM/R-package/src/lightgbm_R.cpp Lines 925 to 926 in 3032b64 |
jameslamb
left a comment
There was a problem hiding this comment.
I'm travelling right now but would really like the chance to review this, now that changes to the R package's public API have been included.
Could you please either reserve the R changes for a separate PR, or otherwise wait 2 days to merge this? I'll be able to provide a review the day after tomorrow.
|
Thank you @jameslamb, I'd really appreciate your review. I don't think we're in a rush to merge this, should I change this to draft or something similar to avoid merging it? |
I was going to say that adding "WIP" to the title will trigger a blocking CI job, but I guess that job doesn't block merging (I can still see the merge button available). So changing this to draft would be fine. As I kind of implied in my comment, it would also have been fine to split the R changes into a separate PR if you wanted. Not a big deal either way though, I'm back and able to review this morning! |
jameslamb
left a comment
There was a problem hiding this comment.
Excellent work! The changes on the R side look good to me, and I appreciate you making the interface there 1-based to match R's conventions. LightGBM's R package has historically been inconsistent about that (see, for example, discussion in #4970).
I left a few very minor suggestions for the R package.
I also think there's one other thing we should talk about now, on this PR. I believe it'd be useful to have a similar interface, but where you could get the number of bins by feature name.
I think that exposing that would prevent the need for users to write code like this to achieve the same thing when using named data objects like data frames:
feat_indx = [
i, _
for feat_name in df.columns
if feat_name = "my_cool_feature"
][0]
ds.feature_num_bin(feat_indx)I see three options for that:
- Don't do this at all.
- Allow the input to
feature_num_bin()in the Python and R packages to acceptUnion[int, str]- assume that a string passed to this method should be matched exactly in
feature_names
- assume that a string passed to this method should be matched exactly in
- Keep
feature_num_bin()in the Python and R packages as-is and later add a newDatasetmethod likefeature_num_bin_by_feature_name()- this accepts a string, matches it to
feature_namesto get the index (iffeature_nameshave been defined), then callsfeature_num_bin()with that index.
- this accepts a string, matches it to
I only bring this up here in this review because if we want to eventually do Option 2, then I'd recommend changing the argument name to feature instead of feature_idx.
I have a mild preference for Option 3, as I think having explicit methods without union types makes it easier to reason about how the code is going to work. But this project already uses union types in so many other places, I'm not opposed to Option 2.
What do you think? also tagging @StrikerRUS @shiyu1994 @guolinke for other opinions.
|
I like option 2 and I think in the R package it's very straightforward to implement because the feature names get stored as column names. However for the Python package it would involve getting the feature names each time. Would it be in the scope of this PR to add a buffer or something so that they're only computed once? |
Sorry, I'm not proposing also accepting feature names as part of this PR. I don't want to delay this PR by adding on more things. Just thought we should make a decision about how we think we might support the use of feature names, since it could impact the choice of argument name for the new methods introduced in this PR. |
|
@jameslamb Thanks a lot for the suggestion of adding access by feature name! I'm for option 2 for the consistency with some other APIs, e.g. LightGBM/python-package/lightgbm/plotting.py Lines 184 to 187 in 5631366 LightGBM/python-package/lightgbm/basic.py Lines 3741 to 3744 in 5631366 |
|
Ok works for me @jmoralez and @StrikerRUS ! Let's go with option 2. @jmoralez for this PR, could you just change the keyword argument name to |
jameslamb
left a comment
There was a problem hiding this comment.
Looks great to me! Thanks very much for the changes and excellent tests.
@jameslamb Sorry for missing the message. I also agree with option 2. |
|
No problem! Thanks for taking the time to come back and comment! |
|
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |


This adds the
LGBM_DatasetGetFeatureNumBinfunction to the C API to be able to get the number of constructed bins for a feature. I also added this to the Python API as the Dataset methodfeature_num_binand added a test for it.Closes #3406.