Skip to content

Use xarray internally #269

@amdecker

Description

@amdecker

Xarray Notes

  • Update documentation to refer to xarray instead of Pandas
  • Add xarray formats for cache & output

There will need to be lots of small changes to the code all over the place if we want to use xarray instead of Pandas. This table gives an overview of pandas methods we use and their equivalent xarray method.

Pandas method Files used xarray equivalent/notes
to_csv output_setting.py, pytest_helper.py N/A, but can export to pandas
read_csv cmod physics.py, shotlist_setting.py, output_setting.py N/A, but can export to pandas
read_sql_query sql.py N/A, but we can use pyodbc (which we already have wrappers for in DisruptionPy)
concat all over https://docs.xarray.dev/en/stable/generated/xarray.concat.html
pandas has interpolation, but we are using scipy's interpolate lots of physics methods use our custom interpolation https://docs.xarray.dev/en/stable/generated/xarray.DataArray.interp.html
merge_asof(on="time", direction="nearest",...) retrieval_manager.py and eval_against_sql.py something like y.reindex(time = x.time, method="nearest", ...) reindex docs
df.columns caching.py, and so many other places ds.data_vars, unless we need one of the coordinates, like time, in which case we would use ds.coords
df["col"].values all over Same! ds["data_var"].values
np.isnan runner.py and more Do np.isnan(ds["data_var"].values) or ds["data_var"].isnull()
assignment of columneg method_dict[parameter] = np.full(len(pre_filled_shot_data), np.nan) runner.py and more Either ds[parameter].values or asssign with coordseg ds[parameter] = ('time', new_values)
len(df) runner.py and more Pandas will give the number of rows when you call len(), but xarray will give the number of data variables. So instead it would be best to get the length of a specific dataarray like len(ds.shot_id) or len(ds.time)
df.drop(column) retrieval_manager.py and more Same! ds.drop(column)
df.to_string() data_difference.py We could use str(ds), or if we want to keep things in a nice table format, just convert to Pandas
df.iloc data_difference.py ds.parameter.loc or if you want to select by coordinate index could do eg ds.isel(time=[0, 2, 4, 6])
pd.testing.assert_frame_equal data_difference.py xr.testing.assert_equal

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprovements or proposed new features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions