[python-package] do not copy column-major numpy arrays when creating Dataset#6721
[python-package] do not copy column-major numpy arrays when creating Dataset#6721StrikerRUS merged 12 commits intomasterfrom
Conversation
|
This could also be done for the predict portion, i.e. LightGBM/python-package/lightgbm/basic.py Lines 1268 to 1307 in 5151fe8 Please let me know if it's ok to include it here or in a separate PR. |
StrikerRUS
left a comment
There was a problem hiding this comment.
LGTM! Thank you for very useful improvement!
I just left a couple of non-blocking comments below for your consideration.
| if mat.flags["F_CONTIGUOUS"]: | ||
| order = "F" | ||
| else: | ||
| order = "C" |
There was a problem hiding this comment.
Just want to highlight that arrays can be C-contiguous and F-contiguous simultaneously.
https://stackoverflow.com/q/60230562
Arrays can be both C-style and Fortran-style contiguous simultaneously. This is clear for 1-dimensional arrays, but can also be true for higher dimensional arrays.
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html
Seems it's correctly handled here.
I believe it would be better to have a follow-up PR for prediction after merging this one. |
jameslamb
left a comment
There was a problem hiding this comment.
This looks great to me! I agree with all of @StrikerRUS 's comments, so won't approve until those are addressed. But I don't have any additional comments.
Thanks for working on this, and really nice tests!
| assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1 | ||
|
|
||
|
|
||
| def test_equal_datasets_from_row_major_and_col_major_data(tmp_path): |
There was a problem hiding this comment.
As this test now doesn't include train(), I think it should go to test_basic.py file.
Would you like to take a look after the recent commits? |
|
Sure, I'll take a look. |
|
This pull request has been automatically locked since there has not been any recent activity since it was closed. |
Changes the logic to flatten the data array to consider its layout:
I ran some quick tests with an array of (100k, 50) and the timings to construct the dataset from a C-contiguous and an F-contiguous arrays are roughly the same, so this doesn't introduce extra latency, just avoids the copy.