Revamp internals and outputs to use xarray by gtrevisan · Pull Request #442 · MIT-PSFC/disruption-py

gtrevisan · 2025-04-15T17:31:46Z

changes

executive summary:

revamped method runner to expect a dataset, or convert a dict into it,
revamped framework to handle a dict of datasets,
revamped output settings:
- OutputSetting is still the abstract base class,
- OutputSettingList is kept for testing purposes, might be deleted in the future,
- DictOutputSetting = Dict[int, xr.Dataset] is the new under-the-hood format,
- SingleOutputSetting is a new semi-abstract class for single-file output,
- DatasetOutputSetting will be the new default in the future, concatenates on idx,
- DataTreeOutputSetting groups by shot, rather than concatenating,
- DataFrameOutputSetting is the usual dataframe, still used by testing.
simplified complexity where I could,
revamped temporary file/folder logging/output for tests,
dropped some unused methods and parameters.

tests:

tested individual formats for creation/equivalence,
revamped part of the other tests,
tested on full DB workflows for all three machines.

to do:

redo the readme flowchart with outputs as "Dataset/DataTree/DataFrame"
double check documentation implications:
- README.md,
- mkdocs,
evaluate test coverage -- are we missing anything?

supersedes:

Use xarray instead of pandas internally #407

closes:

Use xarray internally #269

index

our new index is a simple row-number-like variable named idx, while shot and time are available as coordinates.
I thought about making our idx a MultiIndex of both shot and time, but apparently it cannot be serialized to disk just yet:

NotImplementedError: variable 'idx' is a MultiIndex, which cannot yet be serialized.
Instead, either use reset_index() to convert MultiIndex levels into coordinate variables instead
or use https://cf-xarray.readthedocs.io/en/latest/coding.html.

it can be done in-memory, though:

reindexed = ds.set_index(idx=["shot", "time"])

to re-obtained "native" dimensions, one can then unstack, but this will create a humongous dataset which is the reason we had to close #407, so be mindful of your memory constraints.

import numpy as np
import psutil
s = len(np.unique(reindexed.shot))
t = len(np.unique(reindexed.time))
v = len(reindexed.data_vars)
p = np.float64(1).nbytes
req = s * t * v * p / 1024**3
tot = psutil.virtual_memory().free / 1024**3
print(f"Memory required : {req:.3f} GB")
print(f"Memory available: {tot:.3f} GB")
assert req < tot, "not enough memory!"
reindexed.unstack("idx")

attributes

we are then definitely ready to archive attributes for each physics method!
the first two that pop up would be units of measure, and IMAS path reference.
we could also store refined metadata in our datasets, eg full settings for reproducibility.

dimensions

furthermore, we should be already prepared for multidimensional outputs! 🎉 this very simple example works:

#!/usr/bin/env python3
"""
example for multidimensional physics methods.
"""
import numpy as np
import xarray as xr
from disruption_py.core.physics_method.decorator import physics_method
from disruption_py.settings import RetrievalSettings
from disruption_py.workflow import get_shots_data

@physics_method(columns=["custom"])
def get_custom(params):
    data_vars = {
        "custom": (("idx", "rad"), np.outer(params.times**0, [4, 5, 6]).astype(int)),
    }
    coords = {
        "shot": ("idx", len(params.times) * [params.shot_id]),
        "time": ("idx", params.times),
        "rad": ("rad", [1, 2, 3]),
    }
    return xr.Dataset(data_vars=data_vars, coords=coords)

retrieval_settings = RetrievalSettings(
    efit_nickname_setting="analysis",
    run_methods=["get_kappa_area", "get_custom"],
    custom_physics_methods=[get_custom],
)
ds = get_shots_data(
    tokamak="cmod",
    shotlist_setting=[1150805012, 1150805013],
    retrieval_settings=retrieval_settings,
    output_setting="dataset",
)
print(ds)

<xarray.Dataset> Size: 7kB
Dimensions:     (idx: 150, rad: 3)
Coordinates:
    shot        (idx) int64 1kB 1150805012 1150805012 ... 1150805013 1150805013
    time        (idx) float64 1kB 0.06 0.08 0.1 0.12 0.14 ... 1.74 1.76 1.78 1.8
  * rad         (rad) int64 24B 1 2 3
Dimensions without coordinates: idx
Data variables:
    custom      (idx, rad) int64 4kB 4 5 6 4 5 6 4 5 6 4 ... 6 4 5 6 4 5 6 4 5 6
    kappa_area  (idx) float64 1kB 1.004 1.135 1.414 1.453 ... 1.371 1.277 0.9995

as a side-note, we lose the option of having lower dimensional columns since now both shot and time are "reserved" coordinates for the idx dimension. oh, well.

output

`Dict[int, xr.Dataset]`

poetry run disruption-py {1150805012..1150805014} -o dict

{
   1150805012:
<xarray.Dataset> Size: 42kB
Dimensions:               (idx: 83)
Coordinates:
    shot                  (idx) int64 664B 1150805012 1150805012 ... 1150805012
    time                  (idx) float64 664B 0.06 0.08 0.1 ... 1.294 1.295 1.296
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 664B 0.221 0.2309 ... 0.1022 0.07641
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 664B 0.0008405 0.00401 ... -0.265,

   1150805013:
<xarray.Dataset> Size: 44kB
Dimensions:               (idx: 88)
Coordinates:
    shot                  (idx) int64 704B 1150805013 1150805013 ... 1150805013
    time                  (idx) float64 704B 0.06 0.08 0.1 ... 1.76 1.78 1.8
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 704B 0.2254 0.2325 0.2233 ... 0.1884 0.0
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 704B 0.001214 0.002722 ... -0.006667,

   1150805014:
<xarray.Dataset> Size: 43kB
Dimensions:               (idx: 85)
Coordinates:
    shot                  (idx) int64 680B 1150805014 1150805014 ... 1150805014
    time                  (idx) float64 680B 0.06 0.08 0.1 ... 1.7 1.72 1.74
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 680B 0.223 0.2322 ... 0.1996 0.1937
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 680B 0.001747 0.005514 ... -0.01287
}

`xr.Dataset`

poetry run disruption-py {1150805012..1150805014} -o dataset

<xarray.Dataset> Size: 129kB
Dimensions:               (idx: 256)
Coordinates:
    shot                  (idx) int64 2kB 1150805012 1150805012 ... 1150805014
    time                  (idx) float64 2kB 0.06 0.08 0.1 0.12 ... 1.7 1.72 1.74
Dimensions without coordinates: idx
Data variables: (12/61)
    a_minor               (idx) float64 2kB 0.221 0.2309 0.224 ... 0.1996 0.1937
---------- 8< ---------- 8< ----------
    zcur                  (idx) float64 2kB 0.0008405 0.00401 ... -0.01287

`xr.DataTree`

poetry run disruption-py {1150805012..1150805014} -o datatree

<xarray.DataTree>
Group: /
├── Group: /1150805012
│       Dimensions:               (idx: 83)
│       Coordinates:
│           shot                  (idx) int64 664B 1150805012 1150805012 ... 1150805012
│           time                  (idx) float64 664B 0.06 0.08 0.1 ... 1.294 1.295 1.296
│       Dimensions without coordinates: idx
│       Data variables: (12/61)
│           a_minor               (idx) float64 664B 0.221 0.2309 ... 0.1022 0.07641
│ ---------- 8< ---------- 8< ----------
│           zcur                  (idx) float64 664B 0.0008405 0.00401 ... -0.265
├── Group: /1150805013
│       Dimensions:               (idx: 88)
│       Coordinates:
│           shot                  (idx) int64 704B 1150805013 1150805013 ... 1150805013
│           time                  (idx) float64 704B 0.06 0.08 0.1 ... 1.76 1.78 1.8
│       Dimensions without coordinates: idx
│       Data variables: (12/61)
│           a_minor               (idx) float64 704B 0.2254 0.2325 0.2233 ... 0.1884 0.0
│  ---------- 8< ---------- 8< ----------
│           zcur                  (idx) float64 704B 0.001214 0.002722 ... -0.006667
└── Group: /1150805014
        Dimensions:               (idx: 85)
        Coordinates:
            shot                  (idx) int64 680B 1150805014 1150805014 ... 1150805014
            time                  (idx) float64 680B 0.06 0.08 0.1 ... 1.7 1.72 1.74
        Dimensions without coordinates: idx
        Data variables: (12/61)
            a_minor               (idx) float64 680B 0.223 0.2322 ... 0.1996 0.1937
  ---------- 8< ---------- 8< ----------
            zcur                  (idx) float64 680B 0.001747 0.005514 ... -0.01287

`pd.DataFrame`

poetry run disruption-py {1150805012..1150805014} -o dataframe

           shot  time   a_minor    beta_n  ...   z_error    z_prog  z_times_v_z      zcur
idx                                        ...                                           
0    1150805012  0.06  0.220951 -0.400186  ...  0.000840  0.000000     0.005999  0.000840
1    1150805012  0.08  0.230871 -0.149240  ...  0.004694 -0.000684    -0.053150  0.004010
2    1150805012  0.10  0.223997 -0.038895  ...  0.004983 -0.002052     0.027272  0.002932
3    1150805012  0.12  0.204051  0.030382  ... -0.003536 -0.006589     0.037418 -0.010125
4    1150805012  0.14  0.207518  0.149019  ... -0.001114 -0.007928    -0.054337 -0.009043
..          ...   ...       ...       ...  ...       ...       ...          ...       ...
251  1150805014  1.66  0.211198  0.399452  ...  0.001458 -0.008222     0.038104 -0.006764
252  1150805014  1.68  0.210494  0.411291  ... -0.000450 -0.008000     0.026411 -0.008450
253  1150805014  1.70  0.205653  0.477978  ...  0.003020 -0.007778    -0.033882 -0.004758
254  1150805014  1.72  0.199563  0.547751  ...  0.005765 -0.007556     0.013781 -0.001791
255  1150805014  1.74  0.193681  0.613406  ... -0.005532 -0.007333    -0.098800 -0.012865

[256 rows x 63 columns]

AlexSaperstein

Looks good to me

zapatace · 2025-04-18T17:32:59Z

I have run some A/B tests with the dev/xarray branches. So far everything looks good.

However, I noticed this new behavior. When you rerun get_shots_data() and the output file is not removed in advance, you get the following error.

File ".../disruption_py/workflow.py", line 178, in get_shots_data
   output_setting.to_disk()
File ".../disruption_py/settings/output_setting.py", line 282, in to_disk
   raise FileExistsError(f"File already exists! {self.path}")

Is this to be expected? In dev the output file is overwritten always.

If that is the new behavior you could have this error after retrieving a big amount of the data. So it would be nice to test if the file exist before running the queries. If not, overwrite needs to be forced at the end.

I can not wait to have datasets. Adding profiles or spectral data is going to be great! :-)

gtrevisan · 2025-04-18T17:40:24Z

that is kind of the intended behavior, so far, because every run should have a different temporary folder. are you running get_shots_data twice from the same script? you can also take control of the output, if you want a dataset, by specifying file1.nc and file2.nc ...

zapatace

Let's carry on!

gtrevisan added 30 commits April 9, 2025 15:26

debug connections from each PID

716e47e

close MDSplus trees at cleanup

b01b9c9

more robust shot cleanup

0f5360b

fixup! close MDSplus trees at cleanup

9fc9a84

fixup! debug connections from each PID

fe7b0fd

revamp get_shot_data for max resilience

461587e

fixup! revamp get_shot_data for max resilience

6301e36

skip cleanup for failed connections

672d89b

move reconnect into mds module

286e074

reconnect if method fails with MDSplusERROR

9c61d7b

reimplement the buggy closeAllTrees

960970f

fixup! reimplement the buggy closeAllTrees

8b9bec8

purge open_trees after reconnect

d47cd00

add xarray and netcdf4, remove pandas hdf5 extra

b3bcf47

lock deps

68f0f2d

convert 1st method to xr.dataset

04bb62f

revamp runner to convert dicts into datasets

e128f5a

revamp output setting

00e87c0

revamp output settings

7d74f3f

fix linting

9cb036b

refactor into concat, use str as keys

a59817b

add OutputTypes and tweak typing

119d43e

fixup! add OutputTypes and tweak typing

27f86d8

add typing for to_disk

ff8c41a

revamp get_temporary_folder

0b7f9a0

revamp output settings

92b5ecb

revemo cruft

068e042

revamp test/folder fixtures

db0463a

revamp some tests

3a30452

fix non-fixture for non-pytest execution

fec76d2

remove safe_df_concat

1aefc3e

gtrevisan marked this pull request as ready for review April 16, 2025 19:10

gtrevisan requested review from AlexSaperstein, yumouwei and zapatace April 16, 2025 19:11

minor

5e71dab

This comment was marked as duplicate.

Sign in to view

gtrevisan added 7 commits April 17, 2025 10:50

have get_kappa_area return a dict

d7afbe1

revamp runner to use params.to_dataset for default dim idx

afe211d

possibly better typing

3e18f7f

return empty output if needed

d87b215

remove write_database_table_name

817bd14

drop unused max_processes

97ca761

mark data length mismatch as error, add more info

1de2f39

AlexSaperstein approved these changes Apr 18, 2025

View reviewed changes

zapatace approved these changes Apr 18, 2025

View reviewed changes

yumouwei approved these changes Apr 18, 2025

View reviewed changes

gtrevisan added 6 commits April 22, 2025 15:58

fix f-string

6d31546

check file existence before reading

10b601c

Merge branch 'dev' into xarray

1d803d6

remove changes to physics methods

91cd0bd

remove changes to time_setting

fdec30f

revamp params to have to_coords

3a3e3cc

gtrevisan merged commit 074a9e4 into dev May 5, 2025
12 checks passed

gtrevisan deleted the xarray branch May 5, 2025 17:55

This was referenced May 5, 2025

Use xarray internally #269

Closed

Release v0.11 #414

Merged

gtrevisan mentioned this pull request Aug 21, 2025

Evaluate data ontology metadata info #377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp internals and outputs to use xarray#442

Revamp internals and outputs to use xarray#442
gtrevisan merged 72 commits intodevfrom
xarray

gtrevisan commented Apr 15, 2025 •

edited

Loading

Uh oh!

This comment was marked as duplicate.

AlexSaperstein left a comment

Uh oh!

zapatace commented Apr 18, 2025

Uh oh!

gtrevisan commented Apr 18, 2025

Uh oh!

zapatace left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gtrevisan commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

changes

index

attributes

dimensions

output

Dict[int, xr.Dataset]

xr.Dataset

xr.DataTree

pd.DataFrame

Uh oh!

This comment was marked as duplicate.

AlexSaperstein left a comment

Choose a reason for hiding this comment

Uh oh!

zapatace commented Apr 18, 2025

Uh oh!

gtrevisan commented Apr 18, 2025

Uh oh!

zapatace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gtrevisan commented Apr 15, 2025 •

edited

Loading

`Dict[int, xr.Dataset]`

`xr.Dataset`

`xr.DataTree`

`pd.DataFrame`