-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Description
The API for constructing a Dataset from sampled data (LGBM_DatasetCreateFromSampledColumn) has parameters for the size of the sample set (num_sample_row) and the size of the full data (num_total_row). The num_total_row represents the number of rows in that particular Dataset, not the global number of rows spread over a distributed data corpus. This is because the API uses num_total_row to allocate space for that particular Dataset on that particular machine.
However, num_total_row is used for 2 more purposes: 1) to scale the min_data_in_leaf, and 2) to validate sample size is large enough. For both of these other uses (besides allocation), it works fine for standalone or 1 distributed machine since there is no difference between "num_machine_data" and "num_dist_data". However, for true distributed scenarios, it would be more appropriate to use total distributed data count over all machines.
The issues are somewhat corner cases, but they will not usually throw errors and will not give any notice to the user that something was not correct.
Reproducible example
The following scenarios demonstrate the issues. These scenarios use 2 distributed nodes for simplicity, but the issues get worse as #machines is increased.
Scenario 1A (bad feature rejection):
Both machines have 100K num_total_row (200K total), and 20K num_sample_row, with min_data_in_leaf set to 10. The LGBM_DatasetCreateFromSampledColumn API for both machines will scale the min_data_per_leaf by .2 (20K/100K), so that only 2 rows in a bin (.2 * 10) will be required to consider a bin relevant (compensating for smaller sample size). However, from the user's perspective, setting min_data_in_leaf to 10 was for a 200K total data set, so the "true" scale factor should be .1 (200K/20K), so only 1 row should be required per bin (.1 * 10).
Scenario 1B (bad feature rejection)
1 machine with 100K num_total_row, and 1 machine with 200K num_total_row (300K total), both with 20K num_sample_row. Machine 1 will use a scale factor of .1 on min_data_in_leaf, and machine 2 will use .2. Therefore, they will use different criteria for calculating relevant features (since the features determination will be split between the 2 machines). This is bad enough, but in a scenario where the rows are sent randomly to machine 1 or 2 (might not be easily under user control), then the bin calculations are no longer deterministic. The user would have to tightly control which rows go to each machine (since the calculated feature set for a machine is based on network rank).
Scenario 2A (bad sample count rejection)
2 machines with 100K samples each, and 20K num_sample_row. LGBM_DatasetCreateFromSampledColumn rejects sample size if it is not large enough, either at least 100K or 20% of the total count. In this scenario, 20K samples is not 20% of the total data set (200K), but LGBM_DatasetCreateFromSampledColumn will not reject it due to using only the local machine count. 20K is 20% of 100K, so it will pass. So the hardwired 20% limit is not really used correctly.
Scenario 2B (bad sample count rejection)
1 machine with 100K rows, and 1 machine with 200K rows, both with 20K num_sample_row. Machine 1 will allow the sample count (since it is >= 20%), but machine 2 will reject the sample count (since it is < 20%). This is an inconsistent experience.
Environment info
LightGBM version or commit hash:
Command(s) you used to install LightGBM
Additional Comments
Proposed solution
Create a separate parameter to LGBM_DatasetCreateFromSampledColumn for total_dist_count, and use that for the above 2 purposes (sample count validation and min_data_in_leaf scaling). Make it a separate API for back-compat.
I will submit a suggested PR fix in a bit and edit the link here.