Multi-task regression for predicting polymer properties from SMILES strings using pretrained molecular transformers.
Predict 5 polymer properties from SMILES molecular representations:
- Tg: Glass Transition Temperature
- FFV: Fractional Free Volume
- Tc: Crystallization Temperature
- Density: Material density
- Rg: Radius of Gyration
neurips-polymer-prediction/
├── data/ # Symlink to competition data
├── notebooks/ # Jupyter notebooks for EDA and experiments
├── src/
│ ├── models/ # Model architectures
│ ├── features/ # Feature engineering & preprocessing
│ └── utils/ # Helper functions
├── configs/ # Configuration files
├── experiments/ # Experiment logs and checkpoints
├── submissions/ # Generated submission files
├── requirements.txt # Python dependencies
└── README.md # This file
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install requirements
pip install -r requirements.txt# Create symlink to competition data
ln -s ../neurips-open-polymer-prediction-2025 data/rawIf pip fails, use conda:
conda install -c conda-forge rdkit- Base Model: ChemBERTa-77M (pretrained on 77M SMILES from PubChem)
- Task: Multi-task regression with 5 property heads
- Loss: Weighted MSE (only on available targets)
- SMILES tokenization with ChemBERTa tokenizer
- Multi-task learning with shared encoder
- 5-fold cross-validation
- Data augmentation via SMILES enumeration
- Ensemble with supplemental datasets
jupyter notebook notebooks/01_eda.ipynbpython src/train.py --config configs/chemberta_baseline.yamlpython src/predict.py --model experiments/best_model.pt --output submissions/