Skip to content

[python-package] Booster.get_split_value_histogram() is not implemented in the expected way #7155

@cafferychen

Description

@cafferychen

Description

get_split_value_histogram method does not work the way it should; it does return the right bin_edges of the specific feature

Reproducible example

I used two ways to get the bin_edges of a specific features:

hist, bin_edges= lgb_model.get_split_value_histogram(feat_name) 

df = lgb_model.trees_to_dataframe()
df_split_nodes = df[df['split_gain'].notna()]
bin_edges_2 = df_split_nodes[df_split_nodes ['split_feature'] == feat_name]['threshold'].unique()
bin_edges_2 = sorted(list(set(bin_edges_2))

these two bin_edges are not same. I personally think the second one is right. There is some code not right in the source code of get_split_value_histogram method

The variable bins in the right frame of the below picture should be set to sorted(list(set(values))), other than an integer number which will make the function np.histogram to cut the edges with a same distance and produce a bin_edges that is not used in the process of splitting each node.

Actually, the variable values contains all the thresholds of the features, so just need to the operation like this sorted(list(set(values))), one should get the right bin_edges.

Image

Environment info

LightGBM version or commit hash: 4.6.0

Command(s) you used to install LightGBM

pip3 install lightgbm

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions