Skip to content

feat(training): Deprecate tensorboard and Harmonise loggers interface#850

Merged
HCookie merged 50 commits intomainfrom
deprecate_tensorboard_and_clean_loggers
Feb 13, 2026
Merged

feat(training): Deprecate tensorboard and Harmonise loggers interface#850
HCookie merged 50 commits intomainfrom
deprecate_tensorboard_and_clean_loggers

Conversation

@anaprietonem
Copy link
Contributor

@anaprietonem anaprietonem commented Jan 30, 2026

Description

PR to clean up loggers:

Changes here include proposed changes in #530 (cc: @floriankrb )

What problem does this change solve?

What issue or task does this change relate to?


Additional notes

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

@anaprietonem anaprietonem added the ATS Approval Not Needed No approval needed by ATS label Jan 30, 2026
@anaprietonem anaprietonem self-assigned this Jan 30, 2026
@anaprietonem anaprietonem changed the title Deprecate tensorboard and clean loggers Feat: Deprecate tensorboard and Harmonise loggers interface Jan 30, 2026
@anaprietonem anaprietonem changed the title Feat: Deprecate tensorboard and Harmonise loggers interface feat(training): Deprecate tensorboard and Harmonise loggers interface Jan 30, 2026
@anaprietonem anaprietonem requested a review from HCookie February 9, 2026 11:22
Copy link
Member

@HCookie HCookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, LGTM

@HCookie HCookie merged commit 3da01b7 into main Feb 13, 2026
16 of 17 checks passed
@HCookie HCookie deleted the deprecate_tensorboard_and_clean_loggers branch February 13, 2026 11:46
@github-project-automation github-project-automation bot moved this from For merging to Done in Anemoi-dev Feb 13, 2026
@DeployDuck DeployDuck mentioned this pull request Feb 12, 2026
@icedoom888
Copy link
Contributor

@anaprietonem
This has config changes that throw Pydantic errors, can we flag this somehow?
How could i have known the configs changed without getting an error and digging trough the code and PRs?

@icedoom888 icedoom888 added the config Affects the config files to training label Feb 17, 2026
@anaprietonem
Copy link
Contributor Author

anaprietonem commented Feb 17, 2026

@anaprietonem This has config changes that throw Pydantic errors, can we flag this somehow? How could i have known the configs changed without getting an error and digging trough the code and PRs?

Hey Alberto, we flagged PR with config breaking changed by adding a ! to the tittle. If I missed that here is because this should have provided backward compatibility. What error are you getting? if you send the error log I can have a look

Probably this shouldn't have been flagged as a breaking change PR Alberto, sorry for the oversight

@icedoom888
Copy link
Contributor

icedoom888 commented Feb 17, 2026

Hey @anaprietonem, no worries!
I could only see ATS approval not needed and training flags on this, i think it's important to flag with the config label as well!

I would get:

Running training job on nid001008 at /users/apennino/projects/internal/devml/runs/balfrin_test
[2026-02-17 17:14:09,043][anemoi.utils.cli][ERROR] - 
💣 2 validation errors for BaseSchema

One for the whole tensorboard entry (which is gone) and another for the missing _target_ key that was added in https://github.com/ecmwf/anemoi-core/blob/main/training/src/anemoi/training/config/diagnostics/log/wandb.yaml

My question is how do we make this more transparent?
Usually what happens is:

  1. I pull main
  2. I get Pydantic errors
  3. I spend quite sime time trying to figure out in the super long configs we have, which parameter has been moved, removed or added.
  4. Update the config
  5. Iterate until Pydantic is happy :)

Ideally I would like to know right away how to fix a config, and which config entries have to be updated!
I think we should invest sime time to implement config migration scripts (similar to what is done with checkpoints now)..
And possibly add a pre-commit hook that runs Pydantic validation script to avoid broken schemas.

here a dumb example generated with Gemini to get the feeling:


import sys
from pathlib import Path
from ruamel.yaml import YAML

# Initialize YAML with settings that preserve your file's look and feel
yaml = YAML()
yaml.preserve_quotes = True
yaml.indent(mapping=2, sequence=4, offset=2)

def migrate_configs(config_dir: str, migration_func):
    """Iterates through all YAMLs and applies the migration_func."""
    path = Path(config_dir)
    for yaml_file in path.rglob("*.yaml"):
        with open(yaml_file, 'r') as f:
            data = yaml.load(f)
        
        if data is None: continue # Skip empty files

        # Apply the transformation
        updated_data, changed = migration_func(data)

        if changed:
            with open(yaml_file, 'w') as f:
                yaml.dump(updated_data, f)
            print(f"✅ Updated: {yaml_file}")

# --- YOUR PR-SPECIFIC CHANGES GO HERE ---

def pr_123_migration(data):
    """
    Example: You renamed 'learning_rate' to 'lr' 
    and moved 'batch_size' into a 'training' block.
    """
    changed = False

    # 1. Simple Rename
    if 'learning_rate' in data:
        data['lr'] = data.pop('learning_rate')
        changed = True

    # 2. Structural Move (Nesting)
    if 'batch_size' in data:
        if 'training' not in data:
            # This creates a new CommentedMap to keep formatting intact
            from ruamel.yaml.comments import CommentedMap
            data['training'] = CommentedMap()
        
        data['training']['batch_size'] = data.pop('batch_size')
        changed = True

    return data, changed

if __name__ == "__main__":
    # Point this to your Hydra conf directory
    migrate_configs("./conf", pr_123_migration)

@anaprietonem
Copy link
Contributor Author

Yes so the ticket was initially discussed at ATS and that's the one that had the label #538. Still it's true the PR could have used the ATS approved label too to be more consistent.

Regarding your question, it's an imperfect system at the moment and very much dependant on how you have your configs set up. For example if you a top-level config and just overwrite there some parameters - likely you don't get errors cause you get the defaults directly from the source folder (if you get them if should be broken configs in purpose for the user to be aware something major has changed). If you don't use the repo defaults or have a different set up, I appreciate this process is a painful one at the moment. A way to usually get a feeling of what has changed is look at changes in the AICON integration test.
Screenshot 2026-02-18 at 09 18 04

The migration system is a good idea. But note with checkpoints we don't migrate them on the fly either (it's still the user responsibility to migrate and understand those migrations).

@icedoom888
Copy link
Contributor

@anaprietonem FYI no one (except ECMWF probably) is running with configs built within the anemoi-core config folder.
And inheriting from defaults only, completely defies the purpose of using hydra...
Most of the users have their configs either completely unrolled, or inheriting from a structure of defaults built for the specific center of the experiment at hand.

Every time a config change is introduced, it triggers a 1/2 hours effort to migrate or change all existing configs (without even mentioning the versioning issue: experiment X was trained using config Y and versions Z, now config Y does not work with version Z.1 etc etc)..

I think we should really prioritize a migration function that the user can trigger:
anemoi-training update-config <config_path>

@icedoom888
Copy link
Contributor

Looking at AICON integration test it's a nice trick, but you can agree this is not really sustainable and doesn't really help out scaling changes to 100 configs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS Approval Not Needed No approval needed by ATS config Affects the config files to training training

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants