Learn ML - Classification

This repository is designed as a learning project to practice end-to-end machine learning classification workflows using the Adult Income dataset.
It shows how to go from data profiling → feature engineering → model training → evaluation with multiple algorithms.

🎯 Purpose of This Repo

Practice core ML classification skills.
Compare classic vs modern algorithms.
Understand evaluation beyond just accuracy.
Build a reusable template for other tabular classification problems.

📊 Dataset: Adult Income

Source: UCI Adult dataset (“Census Income”)
Goal: Predict whether an individual earns >50K per year based on demographic and work attributes
Size: ~48,000 rows, 14 features (mix of numeric and categorical)

⚙️ Algorithms Explored

1. Logistic Regression

Baseline, simple and interpretable.
Provides coefficients (odds ratios).
Evaluation: ROC-AUC, PR-AUC, calibration.

2. Decision Tree

Simple “if–then” splits, easy to explain.
Weak performance when used alone.

3. Random Forest

Ensemble of decision trees.
More robust and accurate than a single tree.
Evaluation: ROC-AUC, feature importance.

4. Gradient Boosting Machines

XGBoost / LightGBM / CatBoost.
State-of-the-art for tabular data.
Evaluation: ROC-AUC, PR-AUC, Precision@k / Lift@k.

5. Explainable Boosting Machine (EBM / GAMs)

Logistic regression + curves for each feature.
Transparent and clinician-friendly.
Useful when interpretability is as important as accuracy.

6. Support Vector Machine (SVM) → not used

Strong for high-dimensional, smaller datasets.
Less common for large tabular healthcare-style data.

7. Neural Networks (MLP) → not used

Rarely the best for tabular data unless dataset is very large.
Included for awareness only.

📈 Evaluation Metrics

For imbalanced data, ROC-AUC is not enough. This project focuses on:

ROC-AUC: overall discrimination.
PR-AUC (Average Precision): better reflection of performance when positives are rare.
Confusion Matrix: errors at a chosen threshold (e.g., 0.5).
Precision / Recall / F1: trade-off between false alarms and missed positives.

📏 Making the model choice based on Data Size

Rows

Very small (<1,000 rows):
- High risk of overfitting with complex models.
- Prefer Logistic Regression, small Decision Trees, Random Forest (shallow), or EBM.
- Logistic Regression works even with hundreds of rows, but needs rows > features.
Medium (thousands–hundreds of thousands):
- XGBoost / LightGBM / CatBoost perform strongly.
- Random Forest also works well.
- Neural nets usually not needed here.
Very large (millions+):
- Gradient boosting still works but requires compute resources.
- Logistic Regression scales easily.
- Neural nets become more relevant.

Columns

Low dimensional (tens of features):
- Logistic Regression, Random Forest, Gradient Boosting are fine.
High dimensional (hundreds/thousands of features):
- Common in text, genomics, image data.
- SVM (linear kernel) performs well for small/medium data.
- Logistic Regression with regularisation is strong.
- Trees/boosting may struggle if most features are noise.

🚀 Project Workflow

Data profiling → check class balance, missing values, outliers.
Feature engineering → handle categoricals, scale numerics, create flags.
Model selection and training → start with baselines, then ensembles/boosting.
Cross-validation → compare ROC-AUC.
Final evaluation → on held-out test set with multiple metrics.

This repo has 2 notebooks that are run in sequence 01_eda_feature_eng.ipynb and 02_model_selection_evaluation.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
01_eda_feature_eng.ipynb		01_eda_feature_eng.ipynb
02_model_selection_evaluation.ipynb		02_model_selection_evaluation.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learn ML - Classification

🎯 Purpose of This Repo

📊 Dataset: Adult Income

⚙️ Algorithms Explored

1. Logistic Regression

2. Decision Tree

3. Random Forest

4. Gradient Boosting Machines

5. Explainable Boosting Machine (EBM / GAMs)

6. Support Vector Machine (SVM) → not used

7. Neural Networks (MLP) → not used

📈 Evaluation Metrics

📏 Making the model choice based on Data Size

Rows

Columns

🚀 Project Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learn ML - Classification

🎯 Purpose of This Repo

📊 Dataset: Adult Income

⚙️ Algorithms Explored

1. Logistic Regression

2. Decision Tree

3. Random Forest

4. Gradient Boosting Machines

5. Explainable Boosting Machine (EBM / GAMs)

6. Support Vector Machine (SVM) → not used

7. Neural Networks (MLP) → not used

📈 Evaluation Metrics

📏 Making the model choice based on Data Size

Rows

Columns

🚀 Project Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages