Note: The manuscript has been submitted and is under review.
Citation (preprint available):
R. K. Barman, S. R. Dhruba, D. T. Hoang, E. D. Shulman, E. M. Campagnolo, A. T. Wang, S. A. Harmon, T. C. Hu, A. Papanicolau-Sengos, M. P. Nasrallah, K. D. Aldape, E. Ruppin.
"Pathologist-interpretable breast cancer subtyping and stratification from AI-inferred nuclear features", 2025.
EXPAND is an open-source, interpretable AI pipeline designed to predict breast cancer (BC) subtypes and patient survival risk directly from H&E-stained whole-slide images (WSIs).
While many deep learning models achieve strong accuracy, they often lack interpretability and do not reflect how pathologists evaluate morphology. EXPAND bridges this gap by focusing on a compact set of biologically meaningful nuclear features, making the pipeline intuitive, reproducible, and clinically relevant.
Figure: The full pipeline for EXPAND
- Transparent: Uses 12 Nuclear Pathologist-Interpretable Features (NPIFs) derived from nuclei segmented with open-source tools.
- Robust: Achieves predictive performance comparable to or better than black-box DL models using logistic regression with cross-validation.
- Generalizable: Validated on CPTAC-BRCA and POST-NAT-BRCA cohorts in addition to TCGA-BRCA.
- Scalable: Requires only standard H&E slides and Hover-Net segmentation, making it deployable across cancer types and settings.
- Prognostic: NPIFs independently predict survival outcomes (OS, PFS), enabling clinically interpretable risk stratification.
- 12 NPIFs – compact, biologically interpretable nuclear features (area, perimeter, eccentricity, etc.) aligned with pathologist workflows.
- Subtype prediction – HER2+, HR+, and TNBC classifiers trained with logistic regression.
- External validation – tested on CPTAC-BRCA and POST-NAT-BRCA datasets.
- Survival modeling – multivariate Cox regression models per subtype with Kaplan–Meier analysis for OS and PFS.
- Workflow example – WSI tiling through NPIFs computation.
- All source codes are included in this repository.
- A short user guide is provided below for quick setup. A full, step-by-step pipeline walkthrough is available in the detailed User Guide (PDF).
- The ML predictors were developed on macOS (Python) and tested on Linux (HPC environment). Scripts can be run interactively in a Python IDE or from the command line:
python script_name.py
- Please make sure to update the working directory and adjust all file/folder paths in each script to match your environment before running.
Developed with Python ≥ 3.10. Core dependencies:
numpy >= 1.24.4
pandas >= 2.0.3
matplotlib >= 3.7.2
seaborn >= 0.13.2
scikit-learn >= 1.3.0
joblib >= 1.3.0
opencv-python >= 4.10.0
torch >= 1.12.1
torchvision >= 0.13
Pillow >= 9.2.0
openslide-python >= 1.3.1
tqdm >= 4.65.0
pickle >= 4.0
lifelines >= 0.28.0
To install requirements:
pip install -r requirements.txtThis repository contains the complete EXPAND pipeline for tile generation, nuclear segmentation, NPIF computation, subtype prediction, external validation, and survival analysis. The steps are organized sequentially so users can reproduce the workflow end-to-end.
- Folder:
Slide_preprocessing_codes - Scripts:
1_01_get_tiles_from_slide.py1_11_jobs_to_get_tiles.py
- Task: Generate 512×512 tiles from H&E WSIs at 20× magnification.
- Folder:
NPIFs_generation_codes/TCGA_BRCA/Segmentation - Scripts:
2_01_22_ExtractMorphologicalFeaturesFromHnE.py/.ipynb2_01_100_01_JobSubmissionCode.py/.ipynb
- Task: Run Hover-Net to segment and classify nuclei per tile.
- Folder:
NPIFs_generation_codes/TCGA_BRCA/Morphology_features_calculation - Scripts:
2_02_03_MorphologyCalculation_All_Slides.py/.ipynb2_02_13_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
- Task: Compute per-nucleus morphology (area, perimeter, axis length, eccentricity, circularity).
- Folder:
NPIFs_generation_codes/TCGA_BRCA/NPIFs_Generation - Scripts:
2_03_01_01_NPIFs_Calculation_HoverNet_V0.py/.ipynb(all tiles)2_03_01_01_NPIFs_Calculation_HoverNet_V1.py/.ipynb(top 25% cancer-enriched tiles)
- Task: Compute 12 NPIFs per slide from Hover-Net outputs.
- Folder:
NPIFs_generation_codes/TCGA_BRCA/NPIFs_Generation - Scripts:
3_01_01_02_Mapped_Original_Value_Hovernet_NPIFs_to_BRCA_Subtypes.py/.ipynb(all tiles)3_01_01_06_...Top25Q.py/.ipynb(top 25% tiles)
- Task: Merge NPIFs with HER2, ER, PR metadata.
- Folder:
Subtypes_prediction_codes/TCGA_BRCA - Scripts:
4_01_04_103_04_101_...All_Tiles_Using_Lasso.py/.ipynb4_01_04_103_04_103_...Top25Q.py/.ipynb
- Task: Train logistic regression classifiers (L1 penalty) for HER2+, HR+, TNBC.
- Folder:
NPIFs_generation_codes/CPTAC_BRCA/Segmentation - Segmentation:
2_01_22_02_Test_CPTAC_Dataset_ExtractMorphologicalFeaturesFromHnE.py/.ipynb2_01_100_02_01_JobSubmissionCode.py/.ipynb
- Folder:
NPIFs_generation_codes/CPTAC_BRCA/Morphology_features_calculation - Morphology:
2_02_03_02_CPTAC_MorphologyCalculation_All_Slides.py/.ipynb2_02_13_02_CPTAC_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
- Folder:
NPIFs_generation_codes/CPTAC_BRCA/NPIFs_Generation - NPIFs:
2_03_02_05_CPTAC_BRCA_NPIFs_Calculation_HoverNetPrediction_Filtered_Tiles_Top25Q.py/.ipynb - Folder:
NPIFs_generation_codes/CPTAC_BRCA/NPIFs_Generation - Mapping:
3_01_01_07_CPTAC_Mapped_Original_Value...Top25Q.py/.ipynb - Folder:
Subtypes_prediction_codes/CPTAC_BRCA - External Prediction:
6_01_04_103_04_103_CPTAC_Prediction_Using_...Top25Q.py/.ipynb
- Folder:
NPIFs_generation_codes/POST_NAT_BRCA/Segmentation - Segmentation:
2_01_22_02_Test_POST_NAT_Dataset_ExtractMorphologicalFeaturesFromHnE.py/.ipynb2_01_100_02_POST_NAT_JobSubmissionCode.py/.ipynb
- Folder:
NPIFs_generation_codes/POST_NAT_BRCA/Morphology_features_calculation - Morphology:
2_02_03_02_POST_NAT_MorphologyCalculation_All_Slides.py2_02_13_02_POST_NAT_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
- Folder:
NPIFs_generation_codes/POST_NAT_BRCA/NPIFs_Generation - NPIFs:
2_03_02_05_POST_NAT_BRCA_NPIFs_Calculation_HoverNetPrediction_Filtered_Tiles_Top25Q.py/.ipynb - Folder:
NPIFs_generation_codes/POST_NAT_BRCA/NPIFs_Generation - Mapping:
3_01_01_07_POST_NAT_Mapped_Original_Value...Top25Q.py/.ipynb - Folder:
Subtypes_prediction_codes/POST_NAT_BRCA - Subtype Prediction:
6_01_04_103_04_103_Lasso_POST_NAT_Prediction...Top25Q.py/.ipynb
- Folder:
Survival_codes - Mapping scripts:
5_01_01_mapped_hovernet_npifs_to_tcga_survival.py/.ipynb5_01_02_mapped_pathai_hifs_to_tcga_survival.py/.ipynb5_01_03_mapped_pathai_nuhifs_to_tcga_survival.py/.ipynb5_01_04_mapped_pathai_pifs_to_tcga_survival.py/.ipynb
- Folder:
Survival_codes - Model scripts:
6_01_01_all_npifs_OS_analysis_with_age_cv.py/.ipynb6_01_02_01_all_hifs_OS_analysis...py/.ipynb6_01_03_01_all_nuhifs_OS_analysis...py/.ipynb6_01_04_01_all_pifs_OS_analysis...py/.ipynb
- Folder:
PathAI_codes - Scripts:
1_01_01_mapped_tcga_biomarker_status_to_original_hifs_with_comments.py/.ipynb2_01_01_PathAI_Metadata_Original_nuHIFs_And_TCGA_BiomarkerStatus.py/.ipynb3_01_01_PathAI_Metadata_Original_PIFs_And_TCGA_BiomarkerStatus.py/.ipynb1_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_HIFs_...Classification.py/.ipynb2_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_nuHIFs_..._Classification.py/.ipynb3_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_PIFs_..._Classification.py/.ipynb3_01_04_103_04_103_01_BRCA_Clinical_Subtype_..._All_PathAI_NPIFs_..._Classification.py/.ipynb
- Folder:
Direct_codes - Scripts:
1_01_get_tiles_from_slide.py1_02_get_features_from_tiles2.py1_03_collect_all_features_masks.py1_11_jobs_to_get_tiles.py1_12_jobs_to_get_features2.py1_13_jobs_to_collect_features2.py3_01_01_02_TCGA_BRCASubtypes_to_DirectHnE_Features_Resnet50.py/.ipynb3_01_04_103_04_103_02_BRCA_Clinical_Subtype_Prediction_Using_All_Direct_Features.py/.ipynb
- Task: Extract slide-level embeddings with ResNet50 and train subtype classifiers.
This module filters slides to keep only the top X% of cancer-enriched tiles (ranked by cancer nuclei counts).
NPIFs are then computed from these tiles, mapped to BRCA subtype status, and used for subtype prediction.
By focusing on the most informative tiles, this approach improves signal-to-noise ratio and enhances subtype-specific predictions.
- Folder:
Top_tiles_selection_codes
In addition to cancer nuclei, the EXPAND pipeline also computes NPIFs from immune nuclei.
These immune-derived NPIFs are mapped to BRCA subtypes (HER2+, HR+, TNBC) and compared against cancer nuclei–based EXPAND models, to test whether immune morphology provides complementary predictive value.
-
Immune morphology computation:
NPIFs_generation_codes/TCGA_BRCA/Immune_morphology_calculationNPIFs_generation_codes/CPTAC_BRCA/Immune_morphology_calculationNPIFs_generation_codes/POST_NAT_BRCA/Immune_morphology_calculation
-
Immune NPIF generation:
NPIFs_generation_codes/TCGA_BRCA/Immune_NPIFs_GenerationNPIFs_generation_codes/CPTAC_BRCA/Immune_NPIFs_GenerationNPIFs_generation_codes/POST_NAT_BRCA/Immune_NPIFs_Generation
-
Subtype prediction:
Subtypes_prediction_codes/Immune_subtypes_prediction
All results described in the manuscript can be reproduced using the scripts provided in this repository.
- Follow the step-by-step workflow in the User Guide (PDF) to replicate subtype classification, external validation, and survival analyses.
- All manuscript-related figures are available here: Figures/.
- All TCGA-BRCA subtype-specific models are available here: Models/.
- Ranjan Kumar Barman – [email protected]
- Saugato Rahman Dhruba – [email protected]
Cancer Data Science Lab, NCI, NIH
