-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Description
get_split_value_histogram method does not work the way it should; it does return the right bin_edges of the specific feature
Reproducible example
I used two ways to get the bin_edges of a specific features:
hist, bin_edges= lgb_model.get_split_value_histogram(feat_name)
df = lgb_model.trees_to_dataframe()
df_split_nodes = df[df['split_gain'].notna()]
bin_edges_2 = df_split_nodes[df_split_nodes ['split_feature'] == feat_name]['threshold'].unique()
bin_edges_2 = sorted(list(set(bin_edges_2))these two bin_edges are not same. I personally think the second one is right. There is some code not right in the source code of get_split_value_histogram method
The variable bins in the right frame of the below picture should be set to sorted(list(set(values))), other than an integer number which will make the function np.histogram to cut the edges with a same distance and produce a bin_edges that is not used in the process of splitting each node.
Actually, the variable values contains all the thresholds of the features, so just need to the operation like this sorted(list(set(values))), one should get the right bin_edges.
Environment info
LightGBM version or commit hash: 4.6.0
Command(s) you used to install LightGBM
pip3 install lightgbm