Skip to content

shweta-29/Learn-ML-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Learn ML - Classification

This repository is designed as a learning project to practice end-to-end machine learning classification workflows using the Adult Income dataset.
It shows how to go from data profiling → feature engineering → model training → evaluation with multiple algorithms.


🎯 Purpose of This Repo

  • Practice core ML classification skills.
  • Compare classic vs modern algorithms.
  • Understand evaluation beyond just accuracy.
  • Build a reusable template for other tabular classification problems.

📊 Dataset: Adult Income

  • Source: UCI Adult dataset (“Census Income”)
  • Goal: Predict whether an individual earns >50K per year based on demographic and work attributes
  • Size: ~48,000 rows, 14 features (mix of numeric and categorical)

⚙️ Algorithms Explored

1. Logistic Regression

  • Baseline, simple and interpretable.
  • Provides coefficients (odds ratios).
  • Evaluation: ROC-AUC, PR-AUC, calibration.

2. Decision Tree

  • Simple “if–then” splits, easy to explain.
  • Weak performance when used alone.

3. Random Forest

  • Ensemble of decision trees.
  • More robust and accurate than a single tree.
  • Evaluation: ROC-AUC, feature importance.

4. Gradient Boosting Machines

  • XGBoost / LightGBM / CatBoost.
  • State-of-the-art for tabular data.
  • Evaluation: ROC-AUC, PR-AUC, Precision@k / Lift@k.

5. Explainable Boosting Machine (EBM / GAMs)

  • Logistic regression + curves for each feature.
  • Transparent and clinician-friendly.
  • Useful when interpretability is as important as accuracy.

6. Support Vector Machine (SVM) → not used

  • Strong for high-dimensional, smaller datasets.
  • Less common for large tabular healthcare-style data.

7. Neural Networks (MLP) → not used

  • Rarely the best for tabular data unless dataset is very large.
  • Included for awareness only.

📈 Evaluation Metrics

For imbalanced data, ROC-AUC is not enough. This project focuses on:

  • ROC-AUC: overall discrimination.
  • PR-AUC (Average Precision): better reflection of performance when positives are rare.
  • Confusion Matrix: errors at a chosen threshold (e.g., 0.5).
  • Precision / Recall / F1: trade-off between false alarms and missed positives.

📏 Making the model choice based on Data Size

Rows

  • Very small (<1,000 rows):

    • High risk of overfitting with complex models.
    • Prefer Logistic Regression, small Decision Trees, Random Forest (shallow), or EBM.
    • Logistic Regression works even with hundreds of rows, but needs rows > features.
  • Medium (thousands–hundreds of thousands):

    • XGBoost / LightGBM / CatBoost perform strongly.
    • Random Forest also works well.
    • Neural nets usually not needed here.
  • Very large (millions+):

    • Gradient boosting still works but requires compute resources.
    • Logistic Regression scales easily.
    • Neural nets become more relevant.

Columns

  • Low dimensional (tens of features):

    • Logistic Regression, Random Forest, Gradient Boosting are fine.
  • High dimensional (hundreds/thousands of features):

    • Common in text, genomics, image data.
    • SVM (linear kernel) performs well for small/medium data.
    • Logistic Regression with regularisation is strong.
    • Trees/boosting may struggle if most features are noise.

🚀 Project Workflow

  1. Data profiling → check class balance, missing values, outliers.
  2. Feature engineering → handle categoricals, scale numerics, create flags.
  3. Model selection and training → start with baselines, then ensembles/boosting.
  4. Cross-validation → compare ROC-AUC.
  5. Final evaluation → on held-out test set with multiple metrics.

This repo has 2 notebooks that are run in sequence 01_eda_feature_eng.ipynb and 02_model_selection_evaluation.ipynb


About

This repo is meant to be a learning showcase for ML classification. End-to-end machine learning classification workflow using the Adult Income dataset. Covers data profiling, feature engineering, multiple algorithms (LogReg, Decision Trees, Random Forest, Gradient Boosting, EBM), and evaluation with ROC-AUC, PR-AUC, and confusion matrices.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors