LGBM_DatasetCreateFromSampledColumn fails to handle distributed data correctly sometimes

## Description
The API for constructing a Dataset from sampled data (`LGBM_DatasetCreateFromSampledColumn`) has parameters for the size of the sample set (`num_sample_row`) and the size of the full data (`num_total_row`).  The `num_total_row `represents the number of rows in that particular Dataset, not the global number of rows spread over a distributed data corpus. This is because the API uses `num_total_row `to allocate space for that particular Dataset on that particular machine.

However, `num_total_row `is used for 2 more purposes: 1) to scale the min_data_in_leaf, and 2) to validate sample size is large enough. For both of these other uses (besides allocation), it works fine for standalone or 1 distributed machine since there is no difference between "num_machine_data" and "num_dist_data".  However, for true distributed scenarios, it would be more appropriate to use total distributed data count over all machines. 

The issues are somewhat corner cases, but they will not usually throw errors and will not give any notice to the user that something was not correct.

## Reproducible example
The following scenarios demonstrate the issues.  These scenarios use 2 distributed nodes for simplicity, but the issues get worse as #machines is increased.

### Scenario 1A (bad feature rejection):
Both machines have 100K `num_total_row `(200K total), and 20K `num_sample_row`, with `min_data_in_leaf `set to 10.  The `LGBM_DatasetCreateFromSampledColumn `API for both machines will scale the min_data_per_leaf by .2 (20K/100K), so that only 2 rows in a bin (.2 * 10) will be required to consider a bin relevant (compensating for smaller sample size).  However, from the user's perspective, setting `min_data_in_leaf `to 10 was for a 200K total data set, so the "true" scale factor should be .1 (200K/20K), so only 1 row should be required per bin (.1 * 10).

### Scenario 1B (bad feature rejection)
1 machine with 100K `num_total_row`, and 1 machine with 200K `num_total_row` (300K total), both with 20K `num_sample_row`. Machine 1 will use a scale factor of .1 on min_data_in_leaf, and machine 2 will use .2.  Therefore, they will use different criteria for calculating relevant features (since the features determination will be split between the 2 machines).  This is bad enough, but in a scenario where the rows are sent randomly to machine 1 or 2 (might not be easily under user control), then the bin calculations are no longer deterministic. The user would have to tightly control which rows go to each machine (since the calculated feature set for a machine is based on network rank).

### Scenario 2A (bad sample count rejection)
2 machines with 100K samples each, and 20K `num_sample_row`. `LGBM_DatasetCreateFromSampledColumn `rejects sample size if it is not large enough, either at least 100K or 20% of the total count. In this scenario, 20K samples is not 20% of the total data set (200K), but `LGBM_DatasetCreateFromSampledColumn `will not reject it due to using only the local machine count.  20K is 20% of 100K, so it will pass. So the hardwired 20% limit is not really used correctly.

### Scenario 2B (bad sample count rejection)
1 machine with 100K rows, and 1 machine with 200K rows, both with 20K `num_sample_row`. Machine 1 will allow the sample count (since it is >= 20%), but machine 2 will reject the sample count (since it is < 20%). This is an inconsistent experience.

## Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

```shell

```




## Additional Comments

### Proposed solution
Create a separate parameter to `LGBM_DatasetCreateFromSampledColumn `for total_dist_count, and use that for the above 2 purposes (sample count validation and min_data_in_leaf scaling). Make it a separate API for back-compat.

I will submit a suggested PR fix in a bit and edit the link here.

https://github.com/microsoft/LightGBM/pull/5344

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LGBM_DatasetCreateFromSampledColumn fails to handle distributed data correctly sometimes #5343

Description

Reproducible example

Scenario 1A (bad feature rejection):

Scenario 1B (bad feature rejection)

Scenario 2A (bad sample count rejection)

Scenario 2B (bad sample count rejection)

Environment info

Additional Comments

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LGBM_DatasetCreateFromSampledColumn fails to handle distributed data correctly sometimes #5343

Description

Description

Reproducible example

Scenario 1A (bad feature rejection):

Scenario 1B (bad feature rejection)

Scenario 2A (bad sample count rejection)

Scenario 2B (bad sample count rejection)

Environment info

Additional Comments

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions