[WIP] 410 refactor of model initialisation ie weight loading model freezing transfer learning by JesperDramsch · Pull Request #442 · ecmwf/anemoi-core

JesperDramsch · 2025-07-29T14:54:16Z

Description

Implements instantiatable Model modifiers, that can, e.g. load weights or freeze components.

What problem does this change solve?

It implements a modular and extensible system for the initialisation of models that gets rid of a stack of nested if statements and enables extension in anticipation of #248

What issue or task does this change relate to?

Closes #410
Prepares changes for #248

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

📚 Documentation preview 📚: https://anemoi-training--442.org.readthedocs.build/en/442/

📚 Documentation preview 📚: https://anemoi-graphs--442.org.readthedocs.build/en/442/

📚 Documentation preview 📚: https://anemoi-models--442.org.readthedocs.build/en/442/

JesperDramsch · 2025-07-29T14:56:20Z

I just saw that in the meantime, there was this change, which I will have to address in a future commit:

        model.data_indices = self.data_indices
        # check data indices in original checkpoint and current data indices are the same
        self.data_indices.compare_variables(model._ckpt_model_name_to_index, self.data_indices.name_to_index)

JPXKQX · 2025-07-31T13:37:52Z

Thanks Jesper, I think this refactoring makes a lot of sense. Would it make sense to have another modifier "ResumeRun..." to which we could bring all the logic from run_id, fork_run_id, load_only_weights and warm_start?

JesperDramsch · 2025-08-01T09:19:45Z

Thanks Jesper, I think this refactoring makes a lot of sense. Would it make sense to have another modifier "ResumeRun..." to which we could bring all the logic from run_id, fork_run_id, load_only_weights and warm_start?

Possibly. I believe my original design around fork_run_id ended up confusing most people, so we could take a look whether this design could be used to fix that.

Add extensible checkpoint loading system that separates checkpoint source handling from model weight loading strategies. Changes: - Add CheckpointLoaderRegistry for extensible source loading (local, S3, HTTP, GCS, Azure support) - Add ModelLoaderRegistry for weight loading strategies (standard, weights_only, transfer_learning) - Implement registry pattern for future extensibility This infrastructure enables: - Remote checkpoint loading from cloud storage - Modular loading strategies - Clean separation of concerns - Foundation for advanced features (quantization, PEFT) Depends on: #458 branch infrastructure Related: #422, #410

Replace WeightsInitModelModifier with new checkpoint loading architecture and integrate checkpoint loading into training pipeline. Changes: - Remove WeightsInitModelModifier (functionality moved to checkpoint_loading) - Update TransferLearningModelModifier to use new model_loading system - Add configurable strict and skip_mismatched parameters - Integrate checkpoint loading in training pipeline before model modifiers - Add _load_checkpoint_if_configured method to trainer Benefits: - Clean separation: checkpoint loading vs model transformation - Better parameter control for transfer learning - DRY principle: single checkpoint loading implementation - Extensible: prepare for quantization, PEFT features Related: #422, #410 Depends: checkpoint loading infrastructure

Introduce new configuration schema and templates for the checkpoint loading system to replace legacy WeightsInitModelModifier configs. Changes: - Add checkpoint_loading field to training schema with Pydantic validation - Create checkpoint_loading config directory with templates: * weights_only.yml - Load only model weights * transfer_learning.yml - Load with size mismatch handling * standard.yml - Full Lightning checkpoint loading - Update transfer_learning.yml with new parameters (strict, skip_mismatched) - Add enhanced_fine_tuning.yml example combining transfer learning + freezing Benefits: - Schema validation for checkpoint loading configurations - Pre-built templates for common use cases - Flexible parameter configuration per loader type - Clear separation from model modifier configs Related: #422, #410

Refactor existing tests for new architecture and add comprehensive test coverage for the checkpoint loading system. Changes: - Remove WeightsInitModelModifier tests (functionality moved) - Update TransferLearningModelModifier tests for new #458 integration - Add checkpoint loading integration tests for training pipeline - Add comprehensive test suites from #458 branch: * test_checkpoint_loaders.py - Source loading tests (S3, HTTP, etc.) * test_model_loading.py - Weight loading strategy tests - Update integration tests for new model modifier workflow - Add configuration validation and error handling tests Coverage: - All loader types (weights_only, transfer_learning, standard) - Remote checkpoint sources (S3, HTTP, GCS, Azure) - Training pipeline integration - Configuration validation and error scenarios - Model modifier compatibility with new system Related: #422, #410

- Update documentation with comprehensive checkpoint loading sections - Fix compatibility issues in forecaster and checkpoint utilities - Update config templates to maintain backward compatibility - Fix pre-commit hook issues (ruff RET504, docsig parameter mismatches)

- Add gradient validation during freezing to ensure parameters are truly frozen - Implement optimized module lookup using PyTorch's get_submodule() - Improve error messages with clear context for debugging - Add comprehensive module-level documentation - Enhance test coverage with improved documentation - Add noqa comments for necessary complex methods The gradient validation ensures frozen parameters don't accumulate gradients during training, providing runtime verification of the freezing mechanism. The optimized lookup improves performance for deep models by using O(1) access when possible.

JesperDramsch added 4 commits July 17, 2025 14:11

feat: model modifiers [skip ci]

d71bb6f

refactor: adjust configs for model_modifiers

89e8820

feat: implement model_modifier_applier

1b8ca24

refactor: simplify interfaces for model modifiers

20c462d

JesperDramsch added this to the Fine-Tuning milestone Jul 29, 2025

JesperDramsch self-assigned this Jul 29, 2025

JesperDramsch linked an issue Jul 29, 2025 that may be closed by this pull request

Model Transformation Layer - Post-loading modifications (freezing, transfer learning, adapters) #410

Open

12 tasks

github-project-automation bot added this to Anemoi-dev Jul 29, 2025

github-project-automation bot moved this to Now In Progress in Anemoi-dev Jul 29, 2025

github-actions bot added training enhancement New feature or request labels Jul 29, 2025

mchantry added the ATS Approval Needed Approval needed by ATS label Jul 30, 2025

JesperDramsch mentioned this pull request Aug 5, 2025

Checkpoint Acquisition Layer - Multi-source checkpoint loading (S3, HTTP, local, MLFlow) #458

Open

18 tasks

JesperDramsch added 9 commits August 7, 2025 13:14

wip: End of Day commit

cfb6b76

docs: improve modelmodifier documentation

8a1fc60

fix: sort out #458 and #410

8d46613

This was referenced Aug 20, 2025

Checkpoint System Integration and Migration (Phase 3) #495

Open

Model Transformation Layer - Post-loading modifications (freezing, transfer learning, adapters) #410

Open

Anemoi [Fine-tuning, Transfer Learning, Model Freezing] Roadmap #248

Open

mchantry moved this from Now In Progress to Reviewers needed in Anemoi-dev Sep 1, 2025

JesperDramsch mentioned this pull request Oct 17, 2025

Checkpoint Pipeline Infrastructure (Phase 1) #493

Closed

12 tasks

JesperDramsch mentioned this pull request Oct 17, 2025

feat: Checkpoint pipeline infrastructure (Phase 1) #501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] 410 refactor of model initialisation ie weight loading model freezing transfer learning#442

[WIP] 410 refactor of model initialisation ie weight loading model freezing transfer learning#442
JesperDramsch wants to merge 13 commits intomainfrom
410-refactor-of-model-initialisation-ie-weight-loading-model-freezing-transfer-learning

JesperDramsch commented Jul 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

JesperDramsch commented Jul 29, 2025

Uh oh!

JPXKQX commented Jul 31, 2025

Uh oh!

JesperDramsch commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JesperDramsch commented Jul 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What problem does this change solve?

What issue or task does this change relate to?

Uh oh!

JesperDramsch commented Jul 29, 2025

Uh oh!

JPXKQX commented Jul 31, 2025

Uh oh!

JesperDramsch commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JesperDramsch commented Jul 29, 2025 •

edited by github-actions bot

Loading