fix: suppress SMOTE for multiclass targets (issue #36)#115
Merged
Conversation
The SMOTE guard in template_based_adaptation.py only blocked multi-target cases (len(target_columns) > 1) but allowed SMOTE through for multiclass tasks (task.is_multiclass=True). SMOTE is only valid for binary classification, so extend the guard to also skip it when the task is multiclass. Also add test_smote_not_recommended_for_multiclass to cover the edge case where a binary-imbalanced training split would trigger SMOTE but the full dataset has more than 2 classes (is_multiclass=True). Co-authored-by: openhands <openhands@all-hands.dev> Signed-off-by: openhands <openhands@all-hands.dev>
3063404 to
c876ea3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #36 — SMOTE was being recommended for multiclass classification targets.
Root Cause
_get_target_imbalance_score()inmeta_features.pycorrectly returns0when a target has more than 2 classes, but the imbalance score is computed only on the training split. When a rare third class is absent from that split, the method sees only 2 classes and returns a high score, causing SMOTE to be emitted even though the overall task is multiclass (task.is_multiclass = True).The guard in
template_based_adaptation.py(line 307) only blocked multi-target columns (len(task.target_columns) > 1) and did not checktask.is_multiclass.Fix
sapientml_core/adaptation/generation/template_based_adaptation.pyExtend the SMOTE guard to also skip SMOTE when
task.is_multiclassisTrue:Tests
Added
test_smote_not_recommended_for_multiclasstotests/sapientml/test_generatedcode_additional_patterns.py.The test uses the
target_category_binary_imbalancecolumn (imbalance score ≈ 0.913, which would normally trigger SMOTE) but forcestask.is_multiclass = Trueto simulate the edge case where the full dataset has a 3rd rare class absent from the training split. It asserts that"SMOTE"does not appear in the generated code.Existing binary SMOTE test (
test_additional_classifier_works_with_preprocess) continues to pass, confirming no regression for the binary case.