A production-ready SAS credit risk scoring system using real-world data from the Kaggle "Give Me Some Credit" competition. This model predicts the probability of serious delinquency (90+ days past due) within the next 2 years.
This repository contains a complete SAS-based credit risk modeling pipeline that:
- Loads and preprocesses real credit data from Kaggle
- Engineers predictive features from financial and demographic data
- Trains a logistic regression model with stepwise variable selection
- Validates model performance with comprehensive metrics
- Scores new applications with risk grades and recommendations
- Provides a baseline for SAS-to-Python migration efforts
Kaggle Competition: Give Me Some Credit
Dataset Location: data/cs-training.csv (150,000 records), data/cs-test.csv (101,503 records)
Target Variable: SeriousDlqin2yrs - Binary indicator of serious delinquency in next 2 years
| Feature | Description |
|---|---|
| RevolvingUtilizationOfUnsecuredLines | Credit utilization ratio |
| age | Age of borrower |
| NumberOfTime30-59DaysPastDueNotWorse | Number of 30-59 day late payments |
| DebtRatio | Monthly debt payments / monthly income |
| MonthlyIncome | Monthly income |
| NumberOfOpenCreditLinesAndLoans | Number of open credit lines |
| NumberOfTimes90DaysLate | Number of serious delinquencies (90+ days) |
| NumberRealEstateLoansOrLines | Number of real estate loans |
| NumberOfTime60-89DaysPastDueNotWorse | Number of 60-89 day late payments |
| NumberOfDependents | Number of dependents |
├── data/
│ ├── cs-training.csv # Kaggle training data (150K records)
│ └── cs-test.csv # Kaggle test data (101K records)
├── feature_engineering.sas # Data loading, cleaning, and feature engineering
├── train.sas # Model training and validation scoring
├── metrics_calculation.sas # Comprehensive model validation
├── predict.sas # Score test set predictions
├── MODEL_CARD.md # Detailed model documentation
└── README.md # This file
- SAS 9.4 or later
- Write access to:
/home/u64352077/output/ - Data files: cs-training.csv and cs-test.csv in
data/directory
Run the scripts in order:
/* Step 1: Feature Engineering */
%include "feature_engineering.sas";
/* Step 2: Train Model */
%include "train.sas";
/* Step 3: Calculate Validation Metrics */
%include "metrics_calculation.sas";
/* Step 4: Score Test Set (optional) */
%include "predict.sas";Purpose: Loads raw Kaggle data, handles missing values, and creates predictive features
Key Operations:
- Imports cs-training.csv from
data/directory - Handles missing values:
- MonthlyIncome: Imputed with median by age group
- NumberOfDependents: Imputed with 0
- Creates 40+ engineered features:
- Late payment aggregations
- Delinquency indicators
- Risk flags (high utilization, high debt, etc.)
- Financial health score
- Log transformations
- Interaction features
- Splits data: 70% training / 30% validation
- Exports to output directory
Inputs:
data/cs-training.csv
Outputs:
output/model_features_train.csv(105,000 records)output/model_features_validation.csv(45,000 records)
Purpose: Trains logistic regression model and generates risk scores
Key Operations:
- Trains logistic regression with stepwise selection (p-value thresholds: 0.05)
- Scores both training and validation sets
- Generates confusion matrices
- Creates risk scores (300-850 scale, FICO-like)
- Assigns risk grades (A-F)
- Provides lending recommendations (Approve/Review/Decline)
- Calculates risk-based interest rates
- Computes ROC curves and AUC
Inputs:
output/model_features_train.csvoutput/model_features_validation.csv
Outputs:
output/risk_scores_train.csv- Scored training set with risk gradesoutput/risk_scores_validation.csv- Scored validation setoutput/final_model_output.csv- Final predictions summaryoutput/model_summary.csv- AUC and model metadata
Purpose: Comprehensive model validation and performance analysis
Key Metrics:
- ROC Curve and AUC: Discrimination ability
- Gini Coefficient: Model power (2 × AUC - 1)
- KS Statistic: Maximum separation between good/bad distributions
- Confusion Matrix: At multiple thresholds (0.3, 0.4, 0.5, 0.6, 0.7)
- Accuracy, Precision, Recall, F1-Score, Specificity
- Decile Analysis: Lift and capture rates by risk decile
- Population Stability Index (PSI): Distribution stability (train vs validation)
- Calibration Plot: Predicted vs actual default rates
Inputs:
output/risk_scores_train.csvoutput/risk_scores_validation.csv
Outputs:
output/validation_summary.csv- Overall model performanceoutput/decile_analysis.csv- Lift by decileoutput/threshold_analysis.csv- Metrics at different thresholdsoutput/ks_statistic.csv- KS statisticoutput/calibration_plot.csv- Calibration dataoutput/model_performance_metrics.csv- Comprehensive metrics
Purpose: Score Kaggle test set for competition submission
Key Operations:
- Loads cs-test.csv
- Applies same feature engineering as training
- Handles missing values consistently
- Scores using trained model
- Generates risk scores and grades
- Creates Kaggle submission file
Inputs:
data/cs-test.csvwork.logit_model(from train.sas)
Outputs:
output/kaggle_submission.csv- Competition submission format (row_id, probability)output/test_predictions.csv- Full predictions with risk grades
The model converts default probabilities to credit risk scores and grades:
| Risk Grade | Score Range | Default Probability | Interest Rate | Recommendation |
|---|---|---|---|---|
| A | 750+ | < 10% | 5.0% | Approve |
| B | 700-749 | 10-15% | 7.0% | Approve |
| C | 650-699 | 15-25% | 9.0% | Review |
| D | 600-649 | 25-35% | 12.0% | Decline |
| E | 550-599 | 35-50% | 15.0% | Decline |
| F | < 550 | > 50% | 20.0% | Decline |
Score Formula: 600 + 250 × (1 - predicted_probability)
- Credit utilization (capped at 200%)
- Debt-to-income ratio (capped at 500%)
- Monthly income (log-transformed)
- Age
- 30-59 day late payments
- 60-89 day late payments
- 90+ day serious delinquencies
- Total late payments
- Delinquency flags
- High credit utilization (> 75%)
- Very high utilization (> 100%)
- High debt ratio (> 43%)
- Serious delinquency (90+ days late)
- Multiple delinquencies (≥ 3)
- Low income (< $2,500/month)
- Many dependents (≥ 3)
- Few credit lines (≤ 2)
- Financial health score (0-100)
- Total risk flags count
- Age × log(income)
- Credit utilization × debt ratio
model_features_train.csv- Training features with all engineered variablesmodel_features_validation.csv- Validation features
risk_scores_train.csv- Training set with predictions, scores, gradesrisk_scores_validation.csv- Validation set with predictions, scores, gradesfinal_model_output.csv- Summary of key predictionsmodel_summary.csv- Model metadata and AUC
validation_summary.csv- AUC, Gini, KS, PSI, model statusdecile_analysis.csv- Performance by risk decilethreshold_analysis.csv- Metrics at 0.3, 0.4, 0.5, 0.6, 0.7 thresholdsks_statistic.csv- KS statistic and cutoffcalibration_plot.csv- Predicted vs actual by binmodel_performance_metrics.csv- Comprehensive metrics file
kaggle_submission.csv- Kaggle format (row_id, probability)test_predictions.csv- Full predictions with risk grades
All scripts use these path variables (update if needed):
%let data_path = /home/u64352077;
%let output_path = /home/u64352077/outputs;Solution: Create data/ directory and place cs-training.csv and cs-test.csv files
Solution: Update %let output_path in all scripts to a valid directory with write permissions
Solution: The feature_engineering.sas script automatically imputes missing values using median (income) or 0 (dependents)
Solution: Run train.sas and predict.sas in the same SAS session, or use a permanent library for model storage
Solution: Random seed is set (seed=42) for reproducibility. Same seed = same train/val split
See MODEL_CARD.md for comprehensive model documentation including:
- Model architecture and training details
- Performance metrics and benchmarks
- Feature descriptions
- Risk scoring methodology
Last Updated: 2025 Version: 1.0