Skip to content

Prototype ModelDriver class and esm1.6 driver#229

Open
blimlim wants to merge 14 commits intomainfrom
219-esm1.6-driver
Open

Prototype ModelDriver class and esm1.6 driver#229
blimlim wants to merge 14 commits intomainfrom
219-esm1.6-driver

Conversation

@blimlim
Copy link
Copy Markdown
Collaborator

@blimlim blimlim commented Mar 4, 2026

Closes #216. I've added this PR to start some discussion around the proposed implementation, and so will start it off as a draft.

Background

So far, ESM1.6 simulations have been using the ESM1.5 conversion driver. However, we now wish to change the output file names and locations when running the conversion for ESM1.6, and hence need to add a separate driver. Previous work on the code organisation (#214) and command line interface (#221) laid the groundwork for this change.

However, the way that the ESM1.5 driver is constructed as a large function containing many steps, does not lend itself to adding a similar ESM1.6 driver with a few small differences:

def convert_esm1p5_output_dir(esm1p5_output_dir, process_args):
"""
Driver function for converting ESM1.5 atmospheric outputs during a simulation.
Parameters
----------
esm1p5_output_dir: an "outputXYZ" directory produced by an ESM1.5 simulation.
Fields files in the "atmosphere" subdirectory will be
converted to netCDF.
process_args: argparse Namespace object carrying processing arguments.
Returns
-------
succeeded: list of tuples of (input, output) filepaths for successful
conversions.
failed: list of tuples of form (filepath, exception) for files which failed
to convert due to an allowed exception.
"""
esm1p5_output_dir = Path(esm1p5_output_dir)
current_atm_output_dir = esm1p5_output_dir / "atmosphere"
if not current_atm_output_dir.exists():
raise FileNotFoundError(
errno.ENOENT, os.strerror(errno.ENOENT), current_atm_output_dir
)
# Create a directory for writing netCDF files
nc_write_dir = current_atm_output_dir / "netCDF"
nc_write_dir.mkdir(exist_ok=True)
# Find fields file outputs to be converted
xhist_nml = f90nml.read(current_atm_output_dir / "xhist")
run_id = xhist_nml["nlchisto"]["run_id"]
fields_file_name_pattern = get_fields_file_pattern(run_id)
atm_dir_contents = current_atm_output_dir.glob("*")
atm_dir_fields_files = find_matching_files(
atm_dir_contents, fields_file_name_pattern
)
if len(atm_dir_fields_files) == 0:
warnings.warn(
f"No files matching pattern '{fields_file_name_pattern}' "
f"found in {current_atm_output_dir.resolve()}. No files will be "
"converted to netCDF."
)
return [], [] # Don't try to run the conversion
output_paths = [get_nc_write_path(path, nc_write_dir, get_ff_date(path)) for path in atm_dir_fields_files]
input_output_pairs = zip(atm_dir_fields_files, output_paths)
input_output_pairs = filter_name_collisions(input_output_pairs)
succeeded, failed = convert_fields_file_list(input_output_pairs, process_args)
if process_args.delete_ff:
# Remove files that appear only as successful conversions
for path in safe_removal(succeeded, failed):
os.remove(path)

The only approach I could think of which keeps the current structure would be to copy and paste the esm1.5 driver function, into a new function for esm1.6. This felt like it would be a bad idea, and following discussion with @atteggiani, we decided to investigate implementing a class based approach for the drivers.

Proposed implementation

Based on the ESM1.5 and (embedded in rose/cylc suites) CM2, and CM3 conversion drivers, model conversion drivers seem to have the following general steps in common:

  1. Find the directory containing model outputs to convert
  2. Find/setup the directory to write netCDF files to
  3. Find the fields files in the input directory to be converted
  4. Set output netCDF filenames/paths for each input file
  5. Run the conversion

Based on these common steps, I've added a generic ModelDriver class, with empty methods for steps 1-4. It then combines them into a conversion method for step 5.

class ModelDriver:
"""
Generic model conversion driver class. Defines a general sequence of steps
which are followed by the drivers.
"""

Specific model drivers are then created as subclasses of the ModelDriver class, and they can fill in the model-specific details for the methods in steps 1-4 (e.g. models may have different output directory structures, filenames etc).

This allows us to define a ESM1.5 driver, and an ESM1.6 driver as a subclass, with only the relevant details changed:

class Esm1p5Driver(ModelDriver):
# Output file suffix for each type of unit.
UNIT_SUFFIXES = {
"a": "mon",
"e": "dai",
"i": "3hr",
"j": "6hr",
}
def get_input_dir(self, parent_dir):
"""
Given a path to an experiment parent directory, return the atmosphere output directory

https://g.ithub.com/ACCESS-NRI/um2nc-standalone/blob/c3c68044ba78c2d4e3f33d26dce6d7a790800b90/src/um2nc/drivers/esm1p6.py#L11-L22

This structure is largely based on payu's Model class.

A potential limitation, is that I've assumed conversion drivers will loop through a sequence of input files, and will convert each into a separate netCDF. This wouldn't hold if in the future we want a driver to have more complicated behaviour such as combining monthly data into year length files, or splitting multi-variable files into single variable netCDFs.

@atteggiani any comments or ideas you have about the proposed changes are very welcome! @aidanheerdegen, I've pinged you just to check that the ESM1.6 frequency names which are added to the filenames are as required, however any other comments are welcome too.

@blimlim blimlim marked this pull request as draft March 4, 2026 01:02
@aidanheerdegen
Copy link
Copy Markdown
Member

This looks like a good design @blimlim.

As far as the need to do this goes, you could effectively deprecate the 1p5 driver, just rename it to 1p6 and put in the changes. Anyone who needs old behaviour has to use old versions. This only works if um2nc is separately deployed from the payu environment, which it isn't ... so is maybe not that useful a suggestion.

blimlim and others added 2 commits March 4, 2026 15:31
Co-authored-by: Aidan Heerdegen <aidan.heerdegen@anu.edu.au>
@atteggiani
Copy link
Copy Markdown
Collaborator

atteggiani commented Mar 5, 2026

Thank you @blimlim for starting to work on this!

I like the overall class-based design.

I quickly had a look at the code and added a few comments. Some are questions to try and better understand your design idea.

On a more general level, I also have the following questions:

  1. ... (embedded in rose/cylc suites) CM2, and CM3 conversion drivers, ...

    What do you mean by "embedded in rose/cylc suites"? There is no currently no um2nc driver cm2 command, so how is CM2 driver currently embedded in the rose/cylc suite?

  2. When you mention "inputs" and "outputs" files, directories, etc., are they referring to the inputs and outputs of um2nc, or the models?
    It was not super clear at times and I got a bit confused.
    I know that for model drivers um2nc conversion inputs are model outputs, but ultimately, we care about um2nc inputs and outputs, so for the implementation I don't think we care too much if they are model outputs, random files, or anything else. Therefore, I would suggest referring mostly to "inputs" and "outputs" files relative to um2nc, and using "model outputs" very rarely (Maybe only in a model driver general docstring) to avoid confusion.
    For example, instead of using output_files (referring to the model output files) we could use input_files or conversion_target_files to refer to the um2nc conversion target (input) files.

    1. Find the directory containing model outputs to convert
    2. Find/setup the directory to write netCDF files to
    3. Find the fields files in the input directory to be converted
    4. Set output netCDF filenames/paths for each input file
    5. Run the conversion

    A model driver command, ultimately, is a set of multiple um2nc conversions for multiple files, to which there might also be pre/post-processing associated. For this reason, I would consider each um2nc file conversion independent, and for this reason they could also be easily parallelised in the future.
    Therefore, I would prioritise the following steps:

    1. Get the list of targets (files) to be converted by um2nc coversion.
    2. Get the output directory where to write the converted files (this is basically your point 2)
    3. Perform the conversion (with any pre/post-processing steps)

    I think finding the directory containing the model outputs (which contain some of the um2nc conversion targets above) is a step included within my number 1 above. There might be drivers in which these targets are found in multiple directories, therefore I wouldn't list that as a "core" step for a driver conversion. Also related to this comment.
    Also, your number 4 is part of the post-processing steps, which can be included within the conversion step (it is also partly included in getting the output dir).

These comments are mostly to get a better understanding and have a clearer view of the design model.
And if you think it might be useful, we could also have a chat about this. It would also help me to better align with your vision.

Thank you! :)

blimlim and others added 2 commits March 6, 2026 13:30
Co-authored-by: Davide Marchegiani <davide.marchegiani@gmail.com>
@blimlim
Copy link
Copy Markdown
Collaborator Author

blimlim commented Mar 11, 2026

Thanks @atteggiani for your detailed feedback, those are all very useful points and questions. I'll try to respond to them here!

What do you mean by "embedded in rose/cylc suites"? There is no currently no um2nc driver cm2 command, so how is CM2 driver currently embedded in the rose/cylc suite?

That's correct that there's no drivers for CM2/CM3 in um2nc. Instead there a separate python scripts within the cylc suites which perform essentially the same functions as a driver, i.e. finding inputs, setting output paths, and running the conversion:

I used the common steps between these scripts and the ESM1.5 driver to help think of a general structure.

When you mention "inputs" and "outputs" files, directories, etc., are they referring to the inputs and outputs of um2nc, or the models?

Good point, I agree "inputs" and "outputs" aren't being used consistently right now which is a bit confusing. Your suggestion to consistently use "inputs" for inputs to um2nc, and outputs to refer to output netCDF files makes sense. I've tried to make this consistent in 6e7d5a6, and have also used "model history directory" in the docstrings to refer to the directories created by the model – model output files are often called "history files", and so this might be clearer than the previous use of "model output directories".

I think finding the directory containing the model outputs (which contain some of the um2nc conversion targets above) is a step included within my number 1 above

This makes sense to me, I've updated the proposed implementation in 6e7d5a6 so that finding the input directory is included within step 1.

number 4 is part of the post-processing steps, which can be included within the conversion step (it is also partly included in getting the output dir).

That makes sense to me, I agree it doesn't need to be considered as a core step. In the implementation I think it should be set up as a one of the general driver methods which can be redefined for specific drivers, which will allow each model to have a different naming scheme.

@atteggiani
Copy link
Copy Markdown
Collaborator

atteggiani commented Mar 11, 2026

Instead there a separate python scripts within the cylc suites which perform essentially the same functions as a driver

I see. I suppose that the next step will be to convert these scripts into actual um2nc driver ... comands then, right?

@blimlim
Copy link
Copy Markdown
Collaborator Author

blimlim commented Mar 11, 2026

I suppose that the next step will be to convert these scripts into actual um2nc driver ... comands then, right?

Yeah definitely! For CM3, I believe we'd like to stop using the old um2netcdf script and add a driver here. For CM2 it should be viable to add a driver, but I suspect it might not be a priority

Copy link
Copy Markdown
Collaborator

@atteggiani atteggiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments.

@blimlim blimlim marked this pull request as ready for review March 17, 2026 22:05
@blimlim
Copy link
Copy Markdown
Collaborator Author

blimlim commented Mar 18, 2026

Thanks for the ideas and discussion @atteggiani! I've marked this one ready for review, let me know if you have any other ideas or suggestions!

Copy link
Copy Markdown
Collaborator

@atteggiani atteggiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @blimlim,

Thank you for the updates.

I went through the whole architectural implementation and I only made some comments on the ModelDriver class (because those would influence the rest of the code).

Please have a look at my suggestion, and feel free to ask for any clarification on it.
We might want to have a chat as I feel it would be easier to discuss these design choices.

blimlim and others added 2 commits March 23, 2026 14:33
Co-authored-by: Davide Marchegiani <davide.marchegiani@gmail.com>
@blimlim
Copy link
Copy Markdown
Collaborator Author

blimlim commented Mar 23, 2026

Thanks @atteggiani for the suggestions! I've had a go at implementing the suggested structure, and have added some comments around some of the choices made. Let me know if you have any further thoughts or suggestions!

@blimlim blimlim requested a review from atteggiani March 23, 2026 03:52
Copy link
Copy Markdown
Collaborator

@atteggiani atteggiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @spencer for the updates.

I added a few comments mainly on the common parts (because they affect the rest).

Feel free to start a discussion below any comments :)

@blimlim
Copy link
Copy Markdown
Collaborator Author

blimlim commented Mar 29, 2026

Thanks again @atteggiani for the helpful suggestions and feedback :) I've had a go adding them in, and if you get a chance to look at the latest changes that would be great!

@blimlim blimlim requested a review from atteggiani March 29, 2026 22:57
Copy link
Copy Markdown
Collaborator

@atteggiani atteggiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blimlim for addressing my review!

The updates overall look pretty good! I like the tests you implemented.

I left a few more suggestions and I noticed the specific model drivers are missing tests.

Please refer to the specific comments for details

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for the Esm1p5Driver

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The differences between the Esm1p5Driver and the base ModelDriver are essentially the setting of the atmosphere_dir, output_dir, and unit_suffixes properties, and the specific implementations of the get_input_paths and get_output_path methods.

The get_input_paths method only calls the find_esm1px_fields_files function, which is could probably do with an additional test added.

The get_output_path method almost directly calls the get_nc_filename function which has its own tests

I'm thinking one option would be to remove these functions and implement them directly in get_input_paths and get_output_path methods, and then test them through tests of the drivers. I don't expect these functions to be required outside of the drivers and so it could be a bit cleaner. Would this be preferable?

Copy link
Copy Markdown
Collaborator

@atteggiani atteggiani Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still test each of those properties.

For the methods, if they call a function that needs is reused by multiple drivers (like in this case), I would still keep the function in the common.py and test it in test_common.py, but still test each specific method implemented. Even if the test would only be testing that the function gets actually called (depending on the specific method).
The same exact concept applies for Esm1p6Driver.

Check that netCDF file naming produces expected file paths for various
expected unit keys.
"""
nc_name = esm1p5_convert.get_nc_filename(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nc_name = esm1p5_convert.get_nc_filename(
nc_name = esm1p6.get_nc_filename(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for the Esm1p6Driver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants