Skip to content

Revamp internals and outputs to use xarray#442

Merged
gtrevisan merged 72 commits intodevfrom
xarray
May 5, 2025
Merged

Revamp internals and outputs to use xarray#442
gtrevisan merged 72 commits intodevfrom
xarray

Conversation

@gtrevisan
Copy link
Copy Markdown
Member

@gtrevisan gtrevisan commented Apr 15, 2025

changes

executive summary:

  • revamped method runner to expect a dataset, or convert a dict into it,
  • revamped framework to handle a dict of datasets,
  • revamped output settings:
    • OutputSetting is still the abstract base class,
    • OutputSettingList is kept for testing purposes, might be deleted in the future,
    • DictOutputSetting = Dict[int, xr.Dataset] is the new under-the-hood format,
    • SingleOutputSetting is a new semi-abstract class for single-file output,
    • DatasetOutputSetting will be the new default in the future, concatenates on idx,
    • DataTreeOutputSetting groups by shot, rather than concatenating,
    • DataFrameOutputSetting is the usual dataframe, still used by testing.
  • simplified complexity where I could,
  • revamped temporary file/folder logging/output for tests,
  • dropped some unused methods and parameters.

tests:

  • tested individual formats for creation/equivalence,
  • revamped part of the other tests,
  • tested on full DB workflows for all three machines.

to do:

  • redo the readme flowchart with outputs as "Dataset/DataTree/DataFrame"
  • double check documentation implications:
    • README.md,
    • mkdocs,
  • evaluate test coverage -- are we missing anything?

supersedes:

closes:

index

our new index is a simple row-number-like variable named idx, while shot and time are available as coordinates.
I thought about making our idx a MultiIndex of both shot and time, but apparently it cannot be serialized to disk just yet:

NotImplementedError: variable 'idx' is a MultiIndex, which cannot yet be serialized.
Instead, either use reset_index() to convert MultiIndex levels into coordinate variables instead
or use https://cf-xarray.readthedocs.io/en/latest/coding.html.

it can be done in-memory, though:

reindexed = ds.set_index(idx=["shot", "time"])

to re-obtained "native" dimensions, one can then unstack, but this will create a humongous dataset which is the reason we had to close #407, so be mindful of your memory constraints.

import numpy as np
import psutil
s = len(np.unique(reindexed.shot))
t = len(np.unique(reindexed.time))
v = len(reindexed.data_vars)
p = np.float64(1).nbytes
req = s * t * v * p / 1024**3
tot = psutil.virtual_memory().free / 1024**3
print(f"Memory required : {req:.3f} GB")
print(f"Memory available: {tot:.3f} GB")
assert req < tot, "not enough memory!"
reindexed.unstack("idx")

attributes

we are then definitely ready to archive attributes for each physics method!
the first two that pop up would be units of measure, and IMAS path reference.
we could also store refined metadata in our datasets, eg full settings for reproducibility.

dimensions

furthermore, we should be already prepared for multidimensional outputs! 🎉 this very simple example works:

#!/usr/bin/env python3
"""
example for multidimensional physics methods.
"""
import numpy as np
import xarray as xr
from disruption_py.core.physics_method.decorator import physics_method
from disruption_py.settings import RetrievalSettings
from disruption_py.workflow import get_shots_data

@physics_method(columns=["custom"])
def get_custom(params):
    data_vars = {
        "custom": (("idx", "rad"), np.outer(params.times**0, [4, 5, 6]).astype(int)),
    }
    coords = {
        "shot": ("idx", len(params.times) * [params.shot_id]),
        "time": ("idx", params.times),
        "rad": ("rad", [1, 2, 3]),
    }
    return xr.Dataset(data_vars=data_vars, coords=coords)

retrieval_settings = RetrievalSettings(
    efit_nickname_setting="analysis",
    run_methods=["get_kappa_area", "get_custom"],
    custom_physics_methods=[get_custom],
)
ds = get_shots_data(
    tokamak="cmod",
    shotlist_setting=[1150805012, 1150805013],
    retrieval_settings=retrieval_settings,
    output_setting="dataset",
)
print(ds)
<xarray.Dataset> Size: 7kB
Dimensions:     (idx: 150, rad: 3)
Coordinates:
    shot        (idx) int64 1kB 1150805012 1150805012 ... 1150805013 1150805013
    time        (idx) float64 1kB 0.06 0.08 0.1 0.12 0.14 ... 1.74 1.76 1.78 1.8
  * rad         (rad) int64 24B 1 2 3
Dimensions without coordinates: idx
Data variables:
    custom      (idx, rad) int64 4kB 4 5 6 4 5 6 4 5 6 4 ... 6 4 5 6 4 5 6 4 5 6
    kappa_area  (idx) float64 1kB 1.004 1.135 1.414 1.453 ... 1.371 1.277 0.9995

as a side-note, we lose the option of having lower dimensional columns since now both shot and time are "reserved" coordinates for the idx dimension. oh, well.

output

Dict[int, xr.Dataset]

poetry run disruption-py {1150805012..1150805014} -o dict

{
   1150805012:
<xarray.Dataset> Size: 42kB
Dimensions:               (idx: 83)
Coordinates:
    shot                  (idx) int64 664B 1150805012 1150805012 ... 1150805012
    time                  (idx) float64 664B 0.06 0.08 0.1 ... 1.294 1.295 1.296
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 664B 0.221 0.2309 ... 0.1022 0.07641
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 664B 0.0008405 0.00401 ... -0.265,

   1150805013:
<xarray.Dataset> Size: 44kB
Dimensions:               (idx: 88)
Coordinates:
    shot                  (idx) int64 704B 1150805013 1150805013 ... 1150805013
    time                  (idx) float64 704B 0.06 0.08 0.1 ... 1.76 1.78 1.8
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 704B 0.2254 0.2325 0.2233 ... 0.1884 0.0
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 704B 0.001214 0.002722 ... -0.006667,

   1150805014:
<xarray.Dataset> Size: 43kB
Dimensions:               (idx: 85)
Coordinates:
    shot                  (idx) int64 680B 1150805014 1150805014 ... 1150805014
    time                  (idx) float64 680B 0.06 0.08 0.1 ... 1.7 1.72 1.74
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 680B 0.223 0.2322 ... 0.1996 0.1937
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 680B 0.001747 0.005514 ... -0.01287
}

xr.Dataset

poetry run disruption-py {1150805012..1150805014} -o dataset

<xarray.Dataset> Size: 129kB
Dimensions:               (idx: 256)
Coordinates:
    shot                  (idx) int64 2kB 1150805012 1150805012 ... 1150805014
    time                  (idx) float64 2kB 0.06 0.08 0.1 0.12 ... 1.7 1.72 1.74
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 2kB 0.221 0.2309 0.224 ... 0.1996 0.1937
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 2kB 0.0008405 0.00401 ... -0.01287

xr.DataTree

poetry run disruption-py {1150805012..1150805014} -o datatree

<xarray.DataTree>
Group: /
├── Group: /1150805012
│       Dimensions:               (idx: 83)
│       Coordinates:
│           shot                  (idx) int64 664B 1150805012 1150805012 ... 1150805012
│           time                  (idx) float64 664B 0.06 0.08 0.1 ... 1.294 1.295 1.296
│       Dimensions without coordinates: idx
│       Data variables: (12/61)
│           a_minor               (idx) float64 664B 0.221 0.2309 ... 0.1022 0.07641
│ ---------- 8< ---------- 8< ----------
│           zcur                  (idx) float64 664B 0.0008405 0.00401 ... -0.265
├── Group: /1150805013
│       Dimensions:               (idx: 88)
│       Coordinates:
│           shot                  (idx) int64 704B 1150805013 1150805013 ... 1150805013
│           time                  (idx) float64 704B 0.06 0.08 0.1 ... 1.76 1.78 1.8
│       Dimensions without coordinates: idx
│       Data variables: (12/61)
│           a_minor               (idx) float64 704B 0.2254 0.2325 0.2233 ... 0.1884 0.0
│  ---------- 8< ---------- 8< ----------
│           zcur                  (idx) float64 704B 0.001214 0.002722 ... -0.006667
└── Group: /1150805014
        Dimensions:               (idx: 85)
        Coordinates:
            shot                  (idx) int64 680B 1150805014 1150805014 ... 1150805014
            time                  (idx) float64 680B 0.06 0.08 0.1 ... 1.7 1.72 1.74
        Dimensions without coordinates: idx
        Data variables: (12/61)
            a_minor               (idx) float64 680B 0.223 0.2322 ... 0.1996 0.1937
  ---------- 8< ---------- 8< ----------
            zcur                  (idx) float64 680B 0.001747 0.005514 ... -0.01287

pd.DataFrame

poetry run disruption-py {1150805012..1150805014} -o dataframe

           shot  time   a_minor    beta_n  ...   z_error    z_prog  z_times_v_z      zcur
idx                                        ...                                           
0    1150805012  0.06  0.220951 -0.400186  ...  0.000840  0.000000     0.005999  0.000840
1    1150805012  0.08  0.230871 -0.149240  ...  0.004694 -0.000684    -0.053150  0.004010
2    1150805012  0.10  0.223997 -0.038895  ...  0.004983 -0.002052     0.027272  0.002932
3    1150805012  0.12  0.204051  0.030382  ... -0.003536 -0.006589     0.037418 -0.010125
4    1150805012  0.14  0.207518  0.149019  ... -0.001114 -0.007928    -0.054337 -0.009043
..          ...   ...       ...       ...  ...       ...       ...          ...       ...
251  1150805014  1.66  0.211198  0.399452  ...  0.001458 -0.008222     0.038104 -0.006764
252  1150805014  1.68  0.210494  0.411291  ... -0.000450 -0.008000     0.026411 -0.008450
253  1150805014  1.70  0.205653  0.477978  ...  0.003020 -0.007778    -0.033882 -0.004758
254  1150805014  1.72  0.199563  0.547751  ...  0.005765 -0.007556     0.013781 -0.001791
255  1150805014  1.74  0.193681  0.613406  ... -0.005532 -0.007333    -0.098800 -0.012865

[256 rows x 63 columns]

@gtrevisan gtrevisan marked this pull request as ready for review April 16, 2025 19:10
@AlexSaperstein

This comment was marked as duplicate.

Copy link
Copy Markdown
Contributor

@AlexSaperstein AlexSaperstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@zapatace
Copy link
Copy Markdown
Contributor

I have run some A/B tests with the dev/xarray branches. So far everything looks good.

However, I noticed this new behavior. When you rerun get_shots_data() and the output file is not removed in advance, you get the following error.

File ".../disruption_py/workflow.py", line 178, in get_shots_data
   output_setting.to_disk()
File ".../disruption_py/settings/output_setting.py", line 282, in to_disk
   raise FileExistsError(f"File already exists! {self.path}")

Is this to be expected? In dev the output file is overwritten always.

If that is the new behavior you could have this error after retrieving a big amount of the data. So it would be nice to test if the file exist before running the queries. If not, overwrite needs to be forced at the end.

I can not wait to have datasets. Adding profiles or spectral data is going to be great! :-)

@gtrevisan
Copy link
Copy Markdown
Member Author

that is kind of the intended behavior, so far, because every run should have a different temporary folder. are you running get_shots_data twice from the same script? you can also take control of the output, if you want a dataset, by specifying file1.nc and file2.nc ...

Copy link
Copy Markdown
Contributor

@zapatace zapatace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's carry on!

@gtrevisan gtrevisan merged commit 074a9e4 into dev May 5, 2025
12 checks passed
@gtrevisan gtrevisan deleted the xarray branch May 5, 2025 17:55
This was referenced May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants