This repository is designed as a learning project to practice end-to-end machine learning classification workflows using the Adult Income dataset.
It shows how to go from data profiling → feature engineering → model training → evaluation with multiple algorithms.
- Practice core ML classification skills.
- Compare classic vs modern algorithms.
- Understand evaluation beyond just accuracy.
- Build a reusable template for other tabular classification problems.
- Source: UCI Adult dataset (“Census Income”)
- Goal: Predict whether an individual earns >50K per year based on demographic and work attributes
- Size: ~48,000 rows, 14 features (mix of numeric and categorical)
- Baseline, simple and interpretable.
- Provides coefficients (odds ratios).
- Evaluation: ROC-AUC, PR-AUC, calibration.
- Simple “if–then” splits, easy to explain.
- Weak performance when used alone.
- Ensemble of decision trees.
- More robust and accurate than a single tree.
- Evaluation: ROC-AUC, feature importance.
- XGBoost / LightGBM / CatBoost.
- State-of-the-art for tabular data.
- Evaluation: ROC-AUC, PR-AUC, Precision@k / Lift@k.
- Logistic regression + curves for each feature.
- Transparent and clinician-friendly.
- Useful when interpretability is as important as accuracy.
- Strong for high-dimensional, smaller datasets.
- Less common for large tabular healthcare-style data.
- Rarely the best for tabular data unless dataset is very large.
- Included for awareness only.
For imbalanced data, ROC-AUC is not enough. This project focuses on:
- ROC-AUC: overall discrimination.
- PR-AUC (Average Precision): better reflection of performance when positives are rare.
- Confusion Matrix: errors at a chosen threshold (e.g., 0.5).
- Precision / Recall / F1: trade-off between false alarms and missed positives.
-
Very small (<1,000 rows):
- High risk of overfitting with complex models.
- Prefer Logistic Regression, small Decision Trees, Random Forest (shallow), or EBM.
- Logistic Regression works even with hundreds of rows, but needs rows > features.
-
Medium (thousands–hundreds of thousands):
- XGBoost / LightGBM / CatBoost perform strongly.
- Random Forest also works well.
- Neural nets usually not needed here.
-
Very large (millions+):
- Gradient boosting still works but requires compute resources.
- Logistic Regression scales easily.
- Neural nets become more relevant.
-
Low dimensional (tens of features):
- Logistic Regression, Random Forest, Gradient Boosting are fine.
-
High dimensional (hundreds/thousands of features):
- Common in text, genomics, image data.
- SVM (linear kernel) performs well for small/medium data.
- Logistic Regression with regularisation is strong.
- Trees/boosting may struggle if most features are noise.
- Data profiling → check class balance, missing values, outliers.
- Feature engineering → handle categoricals, scale numerics, create flags.
- Model selection and training → start with baselines, then ensembles/boosting.
- Cross-validation → compare ROC-AUC.
- Final evaluation → on held-out test set with multiple metrics.
This repo has 2 notebooks that are run in sequence 01_eda_feature_eng.ipynb and 02_model_selection_evaluation.ipynb