Skip to content

nc_put_vars_double (or float) fails in parallel #448

@gsjaardema

Description

@gsjaardema

Environment Information

  • What platform are you using? (please provide specific distribution/version in summary)
    • Linux
    • Windows
    • OSX
    • Other
    • NA
  • 32 and/or 64 bit?
    • 32-bit
    • 64-bit
  • What build system are you using?
    • autotools (configure)
    • cmake
  • Can you provide a sample netCDF file or C code to recreate the issue?
    • Yes (please attach to this issue, thank you!)
    • No
    • Not at this time

Summary of Issue

NOTE: my dvarput.c is modified from 4.5.1-devel as described in #447 -- the early return if nels==0 has been removed.

If nc_put_vars_double is called in parallel with stride != 1 and some processors have data to output and some do not and netcdf-4 (hdf5-based) output is being used in a collective mode, then the code will hang since only the processors with data to output will call down in to the H5Dwrite function. This function assumes that all processors will call whether they have data or not and uses a PMPI_Allreduce down in the call stack.

The issue arises in NCDEFAULT_put_vars. If stride is 1, then everything works ok since all processors call NC_put_vars at line 246 of dvarput.c (4.5.1-devel)

However, if the stride is not 1, then the code falls down to the odometer code below that. All processors call odom_init, but then the while is only called by the processors that have data (some lines deleted below):

  odom_init(&odom,rank,mystart,myedges,mystride);
  while(odom_more(&odom)) {
      int localstatus = NC_NOERR;
      localstatus = NC_put_vara(ncid,varid,odom.index,nc_sizevector1,memptr,memtype);
      memptr += memtypelen;
      odom_next(&odom);
   }

If netcdf-4 (hdf5-based) collective output is being done, then the code will hang down below H5Dwrite due to hdf5 library calling PMPI_Allreduce.

I don't have a suggested fix for this issue. I tried rewriting my code to use nc_put_vara_double instead, but that is not easily done for this particular call.

This does work if I use pnetcdf non-collective output and probably also netcdf-4 non-collective

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions