Deep learning-based vascular biometric identification system using metric learning and OpenSet recognition. The system can identify known subjects while rejecting unknown subjects using dorsal hand vein or finger vein patterns.
- OpenSet Recognition: Distinguish between known and unknown subjects
- Metric Learning: Triplet/Contrastive loss for discriminative embeddings
- Subject-Disjoint Splits: Proper evaluation with non-overlapping subjects
- Session-Based Protocol: Separate enrollment and testing samples
- Comprehensive Metrics: CMC curves, OSCR, EER, AUROC, TPR@FPR
- k-NN Support: Configurable k-nearest neighbor decision making
- Automatic Threshold Optimization: Find optimal threshold for OpenSet detection
- Python 3.11+
- CUDA-capable GPU (recommended)
- UV package manager
# Clone the repository
git clone https://github.com/Solvro/ml-vascular-identification.git
cd ml-vascular-identification
# Install dependencies using UV
uv sync
# Or using pip
pip install -e .ml-vascular-identification/
├── src/
│ ├── train.py # Main training script
│ ├── utils.py # Utilities (plotting, protocol saving)
│ ├── data/ # Data loading and preprocessing
│ │ ├── mmcbnu.py # MMCBNU dataset
│ │ ├── dorsal.py # Dorsal hand vein dataset
│ │ ├── data_loaders.py # OpenSet data loaders
│ │ ├── splits.py # Subject-disjoint splitting
│ │ └── transforms.py # Image augmentations
│ └── models/ # Model architectures and losses
│ ├── base.py # Base classes and factories
│ ├── metrics.py # Evaluation metrics
│ ├── losses.py # Triplet/Contrastive losses
│ └── basic/
│ └── basic_cnn.py # CNN architecture
├── config/ # Hydra configuration files
│ ├── train.yaml # Main config
│ ├── data/ # Dataset configs
│ ├── model/ # Model configs
│ └── trainer/ # Training configs
├── tests/ # Unit tests
├── examples/ # Usage examples
└── data/ # Dataset storage (not in repo)
└── mmcbnu/ # MMCBNU dataset
Train a model on MMCBNU dataset with default settings:
uv run src/train.py data=mmcbnuTrain with custom parameters:
uv run src/train.py \
data=mmcbnu \
model.embedding_dim=512 \
model.loss.margin=0.5 \
loader.sampler.P=16 \
loader.sampler.K=4 \
trainer.epochs=100 \
k_neighbors=1 \
threshold_metric=oscrSkip training and evaluate existing model:
uv run src/train.py data=mmcbnu trainer.epochs=0The project uses Hydra for configuration management. Key parameters:
# config/data/mmcbnu.yaml
name: mmcbnu
mode: openset
known_ratio: 0.7 # 70% patients as known classes
val_ratio: 0.15 # 15% of known for validation
subject_disjoint: true # Enforce subject-disjoint splits
enrollment_samples: 7 # Samples for building prototypes
test_samples: 3 # Samples for testing# config/model/basic.yaml
name: simple_cnn
embedding_dim: 384 # Embedding dimensionality
dropout: 0.2 # Dropout rate
use_attention: false # Use attention mechanism
loss:
name: triplet
margin: 0.5 # Triplet loss margin
optimizer:
lr: 0.0001 # Learning rate
weight_decay: 0.0005 # L2 regularization# config/trainer/default.yaml
epochs: 100
log_interval: 10
early_stopping:
enabled: true
patience: 15 # Stop after 15 epochs without improvement
min_delta: 0.0001 # Minimum change to qualify as improvement# config/loader/default.yaml
batch_size: 32
num_workers: 4
sampler:
P: 16 # Classes per batch
K: 4 # Samples per class
# Effective batch size = P × K = 64from data import create_openset_data_loaders
from models import create_model, create_loss
import torch
# Create OpenSet data loaders
loaders, info = create_openset_data_loaders(
dataset_name='mmcbnu',
img_size=224,
known_ratio=0.7,
val_ratio=0.15,
P=16, K=4,
seed=42
)
# Access loaders
train_loader = loaders['train']
val_loader = loaders['val_known']
enrollment_loader = loaders['test_known_enrollment'] # For prototypes
query_loader = loaders['test_known_query'] # For testing
unknown_loader = loaders['test_unknown']
# Create model
model = create_model('simple_cnn', embedding_dim=256)
# Create loss
criterion = create_loss('triplet', margin=0.3)
# Training loop
for epoch in range(epochs):
for images, labels, metadata in train_loader:
embeddings = model(images)
loss = criterion(embeddings, labels)
# ... backward passfrom train import compute_prototypes, evaluate_openset
# 1. Compute prototypes from enrollment samples
prototypes, class_to_idx = compute_prototypes(
model,
enrollment_loader,
num_classes=420,
embedding_dim=256,
device='cuda'
)
# 2. Evaluate on known and unknown samples
metrics = evaluate_openset(
model,
query_loader, # Known class queries
unknown_loader, # Unknown class samples
prototypes,
device='cuda',
class_to_idx=class_to_idx,
threshold=0.9, # OpenSet threshold
k=1 # k-NN (1 for nearest neighbor)
)
# 3. Check results
print(f"CMC Rank-1: {metrics['cmc_rank1']:.2%}")
print(f"OSCR: {metrics['oscr']:.2%}")
print(f"EER: {metrics['eer']:.2f}%")
print(f"AUROC: {metrics['auroc']:.4f}")- CMC Rank-1/5/10: Cumulative Match Characteristic - probability correct match is in top-k
- Known Accuracy: % of known samples correctly identified
- OSCR: Open-Set Classification Rate - correct classification rate vs FPR
- AUROC: Area Under ROC - discrimination between known/unknown
- EER: Equal Error Rate - where FAR = FRR
- TPR@FPR: True Positive Rate at specific False Positive Rate (0.1%, 1%, 10%)
- Unknown Rejection: % of unknown samples correctly rejected
data/mmcbnu/
├── Captured images/
│ ├── 001/ # Patient 001
│ │ ├── L_index_01.bmp # Left index finger, sample 1
│ │ ├── L_index_02.bmp
│ │ └── ...
│ └── 100/ # Patient 100
└── ROIs/ # Region of Interest (alternative)
└── ...
Each patient has 6 fingers × 10 samples = 60 images.
data/dorsal/
├── patient_001/
│ ├── sample_01.png
│ ├── sample_02.png
│ └── ...
└── patient_100/
After training, the following files are generated:
best_model_mmcbnu.pt # Best model checkpoint
best_model_mmcbnu_prototypes.pt # Prototypes and evaluation metrics
evaluation_protocol_mmcbnu.json # Detailed evaluation results
det_curve_mmcbnu.png # Detection Error Tradeoff curve
roc_curve_mmcbnu.png # ROC curve
{
"timestamp": "2025-11-03T12:00:00",
"dataset": {
"name": "mmcbnu",
"known_classes": 420,
"unknown_classes": 180
},
"metrics": {
"cmc_rank1": 0.9929,
"cmc_rank5": 0.9976,
"oscr": 0.9853,
"auroc": 0.9902,
"eer": 3.40,
"known_accuracy": 0.9579,
"unknown_rejection": 0.9789
}
}Run the test suite:
# All tests
uv run pytest
# Specific test file
uv run pytest tests/test_train.py
# With coverage
uv run pytest --cov=src --cov-report=htmlTo add a new dataset:
- Create dataset class in
src/data/your_dataset.py:
from data.base import VascularDataset
class YourDataset(VascularDataset):
def scan_dataset(self):
# Scan and return list of sample dicts
pass-
Add configuration in
config/data/your_dataset.yaml -
Register in
src/data/__init__.py
To add a new model architecture:
- Create model in
src/models/your_model/:
from models.base import BaseEmbeddingModel
class YourModel(BaseEmbeddingModel):
def __init__(self, embedding_dim=256):
super().__init__()
# Define architecture
def forward(self, x):
# Forward pass
return embeddings- Register in
src/models/__init__.py:
MODEL_REGISTRY['your_model'] = YourModel- Batch Size: Use P×K = 64 for good GPU utilization
- Learning Rate: Start with 1e-4, reduce if unstable
- Margin: 0.3-0.5 works well for triplet loss
- Embedding Dim: 256-512 is sufficient
- Early Stopping: Patience of 10-15 epochs prevents overfitting
| Metric | Value |
|---|---|
| CMC Rank-1 | 99.29% |
| CMC Rank-5 | 99.76% |
| OSCR | 98.53% |
| AUROC | 99.02% |
| EER | 3.40% |
| Known Accuracy | 95.79% |
| Unknown Rejection | 97.89% |
Configuration: 384-dim embeddings, triplet loss (margin=0.5), P=16, K=4, 52 epochs
- Check data augmentation is not too aggressive
- Verify prototypes computed from enrollment samples
- Ensure subject-disjoint splits are correct
- Try increasing embedding dimension
- Reduce batch size (decrease P or K)
- Use smaller images (e.g., 128x128 instead of 224x224)
- Enable gradient checkpointing
- Reduce number of workers
- Decrease learning rate
- Add learning rate warmup
- Check for NaN values in loss
- Verify data normalization