Skip to content

memory fixes for netCDF-4 code#3330

Merged
WardF merged 22 commits intoUnidata:mainfrom
edhartnett:ejh_0315_2
Apr 7, 2026
Merged

memory fixes for netCDF-4 code#3330
WardF merged 22 commits intoUnidata:mainfrom
edhartnett:ejh_0315_2

Conversation

@edhartnett
Copy link
Copy Markdown
Contributor

@edhartnett edhartnett commented Mar 17, 2026

Fixes #2626

In this PR I fix some reported memory leaks. As with other recent fixes, only a small amount of code is involved.

This is another of the areas where Dennis and I spent plenty of time, but we still missed a few frees.

This PR also includes PR #3329 so that these tests are run with HDF5-2.0.0. There are some memory leaks in HDF5 itself in earlier versions, so memory testing for these bugs is restricted to HDF5-2.0.0 and greater versions.

I also checked memory leaks under macos and windows and these fixes also work there, but I do not test those platforms with automatic testing; ubuntu is easiest, and since these are classic memory leaks (i.e. malloc without calling free) they work the same on every platform. However, measuring memory use is different on each platform and a test that works on all platforms would contain a large amount of extra code, and provide no additional benefit.

@edhartnett edhartnett marked this pull request as ready for review March 17, 2026 15:49
@edhartnett
Copy link
Copy Markdown
Contributor Author

The CI is failing due to DAP4 timeouts. I can make that testing more robust in a future PR, but I want to stay focused on memory issues in this PR.

Some of the leaks were in HDF5, and only resolved in the 2.0.0 release, so in this PR I add HDF5-2.0.0 testing to the CI. I also have some leak tests that only run with HDF5 >= 2.0.0.

If there is another way you want to do HDF5-2.0.0 testing, let me know. If it is using docker, that I can take a look at that. HDF5-2.0.0 has been out a very long time, so it's not clear why there is no testing for it in the CI.

@WardF
Copy link
Copy Markdown
Member

WardF commented Mar 31, 2026

I'll merge this with the 2.0.0, and bumping to the 2.1.x bugfix can wait for a separate issue.

@edhartnett
Copy link
Copy Markdown
Contributor Author

@WardF sounds good, let's get this merged and I can then add 2.1.0 testing as well.

@edhartnett edhartnett mentioned this pull request Apr 1, 2026
@WardF
Copy link
Copy Markdown
Member

WardF commented Apr 1, 2026

There's a lot going on here, and on a whim, I asked Claude to review this PR and it generated a nice report that pointed out some issues with this PR. It illustrates the issue/challenge with using these tools, since their output needs to be verified. I have no idea which version of Claude (the one who generated these fixes? The one who generated the report I've attached here?) is wrong, but the amount of time it would take to validate this is more than I have at the moment. I think having PR's fix one issue at a time (unless the issues are clearly linked and are solved by a single fix) will help keep this manageable from a time-investment standpoint.

As it stands, I don't know if this PR actually fixes the issues it claims to fix (Claude says 2664, 2666, 2667 are not substantively fixed), and neither 2668 or 2436 have any code fix. Given the size of what needs to be double checked manually, this is merely shifting my time commitment from 'code review' to 'AI fact checking'. Robbing Peter to pay OpenAI, one supposes XD.

@WardF
Copy link
Copy Markdown
Member

WardF commented Apr 1, 2026

Update: It's Markdown so I just went ahead and pasted here.

PR #3330 Review: Memory Fixes for netCDF-4 Code

Prompt summary: Perform a high-quality review of upstream PR #3330 ("memory fixes for netCDF-4 code") and generate a report. The review covers code correctness, completeness of fixes, test quality, CI changes, and platform compatibility.

Intended audience: netCDF-C maintainers and contributors with familiarity with C memory management, HDF5 internals, and the netCDF-4 dispatch architecture.


Table of Contents

  1. PR Overview
  2. Issues Addressed
  3. Files Changed
  4. File-by-File Analysis
  5. Key Findings and Concerns
  6. Platform Compatibility
  7. AI-Generation Assessment
  8. Summary and Recommendation

PR Overview

Field Value
PR Number #3330
Title memory fixes for netCDF-4 code
Author edhartnett (Ed Hartnett)
State Open
Target branch main
Commits 21
Files changed 12 (+617 / -34)
Requested reviewers DennisHeimbigner, WardF
Includes PR #3329 (HDF5 2.0.0 CI infrastructure)

This PR targets a cluster of memory safety bugs in the netCDF-4 / HDF5 backend. Two source files receive actual defect fixes; the remaining ten files are new tests, CI workflow updates, and build system changes.


Issues Addressed

Issue Title Category Fix Location
#2626 Memory leak with repeated open/close of NetCDF4 files Memory growth Identified as HDF5 ≥ 2.0.0 fix + new tests
#2664 SEGV in get_attached_info (hdf5open.c) SEGV / corrupt file libhdf5/hdf5open.c
#2665 Memory leak in NC_hashmapnew via nc4_nc4f_list_add Leak on error path libsrc4/nc4internal.c
#2666 Heap-buffer-overflow in get_attached_info Heap overflow libhdf5/hdf5open.c
#2667 memcpy-param-overlap in get_attached_info Memory overlap libhdf5/hdf5open.c
#2668 Heap-buffer-overflow in NC4_get_vars (hdf5var.c) Heap overflow Not changed in this PR
#2436 SegV with corrupt netCDF files SEGV / corrupt file Not changed in this PR

Files Changed

File Status +/- Category
libhdf5/hdf5open.c Modified +9/-2 Core fix
libsrc4/nc4internal.c Modified +10/-1 Core fix
nc_test4/tst_mem_safety.c Added +318 New test
h5_test/tst_h5_open_close.c Added +119 New test
.github/workflows/run_tests_ubuntu.yml Modified +25/-9 CI
.github/workflows/run_tests_osx.yml Modified +29/-17 CI
.github/workflows/run_tests_win_mingw.yml Modified +20/-4 CI
configure.ac Modified +19/-0 Autotools build
h5_test/CMakeLists.txt Modified +16/-0 CMake build
h5_test/Makefile.am Modified +19/-0 Autotools build
nc_test4/CMakeLists.txt Modified +17/-0 CMake build
nc_test4/Makefile.am Modified +16/-1 Autotools build

File-by-File Analysis


libhdf5/hdf5open.c

Context: get_attached_info() is a static helper that allocates per-dimension bookkeeping arrays (dimscale_attached, dimscale_hdf5_objids) sized by the ndims parameter, then iterates over each dimension to collect HDF5 dimension-scale associations.

Change introduced:

+        /* Guard: ndims parameter must agree with var->ndims. A mismatch
+         * indicates a corrupt file; the loop below would overrun the
+         * allocations if they differed (issues #2664, #2666, #2667). */
+        if (ndims != var->ndims)
+            return NC_EVARMETA;
+
         /* Allocate space to remember whether the dimscale has been … */
         if (!(hdf5_var->dimscale_attached = calloc((size_t)ndims, …)))
             return NC_ENOMEM;
         if (!(hdf5_var->dimscale_hdf5_objids = malloc((size_t)ndims * …)))
             return NC_ENOMEM;

-        for (unsigned int d = 0; d < var->ndims; d++)
+        for (unsigned int d = 0; d < ndims; d++)

Call graph analysis: get_attached_info has two call sites:

  1. get_scale_info() → line 1439: get_attached_info(var, hdf5_var, ndims, datasetid) — where ndims was supplied by read_var() and was also used to call nc4_var_list_add(grp, finalname, (int)ndims, &var) earlier in that same frame, making var->ndims == ndims by construction.
  2. nc4_get_var_meta() → line 1513: get_attached_info(var, hdf5_var, var->ndims, …) — explicitly passes var->ndims as the ndims argument.

⚠️ Finding: The guard if (ndims != var->ndims) is unreachable in the current call graph. Both call sites guarantee ndims == var->ndims. The guard is defensive documentation rather than an active fix.

⚠️ Finding: The loop bound change (var->ndimsndims) is a no-op given the above. Since the two values are always equal, the behavioral change is zero. The change is logically consistent but does not fix any live bug in the current code.

⚠️ Finding: The actual crashes reported in #2664, #2666, #2667 are not fully addressed. All three issue backtraces show the crash occurring inside H5DSget_num_scales() — which is called before the allocation and the loop that the fix targets. A corrupt HDF5 file can cause H5DSget_num_scales to read from a malformed vlen attribute and crash entirely within the HDF5 library. The netCDF-C fix location is downstream of the crash site. These crashes are fundamentally HDF5-level bugs when fed corrupt files and cannot be fully prevented by netCDF-C alone without skipping the H5DSget_num_scales call.

✅ What the change does accomplish: If a future caller were to pass a different ndims value, or if var->ndims were ever modified between read_var and get_attached_info calls, the guard would prevent a buffer overrun. It correctly documents the contract. The loop bound change makes the intent self-consistent.

Pre-existing issue (not introduced by PR): If the second malloc (dimscale_hdf5_objids) fails and returns NC_ENOMEM, the first allocation (dimscale_attached) is leaked. This is present in the original code and not fixed by this PR.


libsrc4/nc4internal.c

Context: nc4_nc4f_list_add() initializes the NC_FILE_INFO_T structure including three NCindex lists (each internally containing an NC_hashmap). If nc4_grp_list_add() fails, the original code returned immediately without freeing any of the previously allocated resources.

Change introduced:

     if ((retval = nc4_grp_list_add(h5, NULL, NC_GROUP_NAME, &h5->root_grp)))
+    {
+        free(h5->hdr.name);
+        nclistfree(h5->alldims);
+        nclistfree(h5->alltypes);
+        nclistfree(h5->allgroups);
+        free(h5);
+        nc->dispatchdata = NULL;
         return retval;
+    }

✅ This fix is correct and complete. It frees all resources allocated in the function's preamble: the strdup'd path, the three NCindex lists (and their internal hashmaps), and the NC_FILE_INFO_T block itself. Setting nc->dispatchdata = NULL prevents a dangling pointer if the caller inspects the field after the error return.

Edge-case analysis:

  • h5->hdr.name could be NULL if strdup(path) failed; free(NULL) is safe per C standard.
  • nclistfree(NULL) is safe in netCDF-C's implementation.
  • All three list allocations happen unconditionally before the guard, so they are always non-NULL or NULL (if nclistnew() returns NULL on OOM), both of which are handled safely.

nc_test4/tst_mem_safety.c

A new 318-line test with three sub-tests:

Test 1: test_open_close_loop() — Regression for #2626

Creates a NetCDF4 file, performs a warmup of 20 open/close cycles, snapshots getrusage()/ru_maxrss, runs 500 more cycles, and asserts RSS growth is below 1 MB. Only active on Linux with HDF5 ≥ 2.0.0.

✅ Methodology is sound for catching monotonic memory growth. The warmup absorbs HDF5's internal one-time initialization allocations.

⚠️ Comment inaccuracy: Both test files state "Linux: getrusage()/ru_maxrss is the current RSS in KB." Per the Linux getrusage(2) man page, ru_maxrss is the maximum (peak, high-water mark) resident set size since process start — not the instantaneous current RSS. The test logic is still valid for leak detection (a leak causes the peak to grow), but the comment should read "peak RSS since process start" to avoid misleading future maintainers.

⚠️ Threshold choice: 1 MB over 500 iterations = 2 KB/cycle tolerance. This is generous for OS page-granularity noise, but whether this is tight enough to catch real-world leaks should be validated against baseline runs.

Test 2: test_coord_var_roundtrip() — Regression baseline for #2664/#2666/#2667

Creates a 2D variable with coordinate variables, reopens, and reads back — a valid-file scenario exercising get_attached_info() on the happy path.

⚠️ This test does not reproduce the reported bugs. Issues #2664, #2666, #2667 were triggered by specifically crafted corrupt POV files. The test confirms that valid files work correctly, but does not confirm that the corrupt-file crashes are fixed.

Test 3: test_get_vars_bounds() — Regression baseline for #2668

Tests nc_get_vars_float with zero count, out-of-bounds start, edge-overflow start+count, strided access, and zero-stride on a valid 2D file.

⚠️ Same limitation: Issue #2668 is a heap-buffer-overflow triggered by a corrupt POV file via NC4_get_vars in hdf5var.c. The test only covers valid-file API boundary conditions. There is no change to hdf5var.c in this PR. The strided verification logic was checked and is arithmetically correct.

⚠️ Copyright year: The file header reads "Copyright 2024" while tst_h5_open_close.c correctly reads "Copyright 2026". This should be corrected to 2026.


h5_test/tst_h5_open_close.c

A new 119-line pure-HDF5 (no netCDF) regression test that exercises H5Fopen/H5Fclose in a tight loop to isolate whether any RSS growth from issue #2626 originates inside HDF5 itself versus in netCDF-C's usage of HDF5.

✅ Good isolation strategy. Separating the HDF5-level test from the netCDF-level test makes the source of any leak immediately actionable.

✅ Correct use of H5F_CLOSE_WEAK — this matches the file-close behavior netCDF-4 uses, ensuring the test accurately mirrors production usage.

⚠️ Same ru_maxrss comment inaccuracy as in tst_mem_safety.c.


CI Workflow Files

.github/workflows/run_tests_ubuntu.yml and run_tests_osx.yml

HDF5 2.0.0 added to the test matrix using CMake build (the standard build system for HDF5 2.x, which dropped the Autotools configure script). Older HDF5 versions continue with the Autotools path.

✅ Correct detection and branching for the HDF5 2.0.0 tarball location (GitHub releases vs. the legacy NCSA FTP server).

✅ Cache key versioned (-v3 suffix) to force a clean build after the new CMake-based HDF5 2.0.0 was added.

✅ Matrix exclusion (exclude: hdf5: 2.0.0 / use_nc4: nc3) avoids pointless CI jobs: there is no reason to run NC3-only tests against HDF5 2.0.0.

ℹ️ Note on HDF5 2.0.0 vs 2.1.x: PR discussion thread acknowledges that HDF5 2.1.0 (a bugfix release) exists and a follow-up PR will add it. This is an acceptable staged approach.

.github/workflows/run_tests_win_mingw.yml

Adds retry logic (3 attempts, 15-second delay) around the pacman -U downgrade step that downloads HDF5 from the MSYS2 repository.

✅ Addresses intermittent network failures that would fail CI builds non-deterministically. The retry pattern is simple and correct.


Build System Files

configure.ac adds two new Autotools conditionals:

  • HDF5_VERSION_GE_200 — detected via AC_COMPILE_IFELSE checking H5_VERS_MAJOR.
  • HOST_IS_LINUX — detected by matching $host_os against linux*.

Both conditionals are needed to gate the RSS-based memory tests appropriately.

CMake equivalent in h5_test/CMakeLists.txt and nc_test4/CMakeLists.txt uses ${HDF5_VERSION} VERSION_GREATER_EQUAL "2.0.0" and CMAKE_SYSTEM_NAME STREQUAL "Linux".

✅ Both build systems are consistent in their gating logic. The binaries are always compiled; only test registration is conditional, so developers on macOS or with older HDF5 can still run the tests manually.


Key Findings and Concerns

🔴 Critical

# Finding
1 Issues #2664, #2666, #2667 not substantively fixed. The crashes occur inside H5DSget_num_scales() — before any code changed by this PR. The guard and loop-bound fix are unreachable in the current call graph and do not prevent the actual crash site. These are corrupt-file crashes inside HDF5 itself.
2 Issue #2668 has no code fix. hdf5var.c is not modified. The test covers valid-file edge cases only, not the corrupt POV file scenario. The PR claims to fix #2668 but the fix is absent.
3 Issue #2436 has no code fix. No change addresses the corrupt-file SegV reported in #2436.

🟡 Moderate

# Finding
4 if (ndims != var->ndims) guard is unreachable in the current call graph. It is good defensive code, but the PR description overstates its effectiveness as a fix for the reported crashes.
5 Test cases for #2664/#2666/#2667/#2668 use valid files and do not reproduce the original corrupt-file scenarios. These are regression baselines, not regression tests for the actual bugs.

🟢 Minor

# Finding
6 ru_maxrss comment in both new test files incorrectly states "current RSS." It is the peak (high-water mark) RSS. The test logic remains valid for leak detection.
7 tst_mem_safety.c has copyright year "2024" — should be "2026".
8 Pre-existing leak: second malloc failure in get_attached_info leaks dimscale_attached. Not introduced by this PR, but not fixed either.

Platform Compatibility

32-bit

No issues. All arithmetic uses size_t for allocation sizing. The ndims != var->ndims comparison is safe on 32-bit. getrusage and ru_maxrss are long, which is 32-bit on 32-bit platforms; RSS values will be correct for any realistic file.

64-bit

No issues. Standard behavior.

macOS

CI correctly expands to HDF5 2.0.0. The RSS tests are built but not registered for CI execution on macOS (HDF5 file-backed mmap pages cause ru_maxrss to fluctuate). Developers can still run them manually. No platform-specific concerns with the core fixes.

Linux

Primary target for memory testing. The RSS approach is sound for detecting heap leaks on Linux. No concerns.

Windows (MSVC / MinGW)

No HAVE_SYS_RESOURCE_H on Windows, so all getrusage-based checks are compiled out. The retry logic for pacman is a clear improvement for CI reliability. No concerns with the core fixes on Windows.

Big-Endian

No endianness-sensitive changes. The RSS measurement and the error-path cleanup are byte-order neutral. No concerns.


AI-Generation Assessment

Estimated likelihood of AI assistance: High for test code and CI changes; Moderate for the core fixes.

Supporting signals:

  • Highly uniform, comprehensive block comments in both new test files with detailed platform rationale paragraphs.
  • All CI changes follow a perfectly consistent structural pattern across three workflow files.
  • Test functions follow a formulaic structure (create → populate → close → reopen → verify → cleanup).
  • The PR fixes multiple loosely-related issues in a single branch, a common pattern in AI-assisted batch-fixing sessions.
  • The retry logic in run_tests_win_mingw.yml is a textbook CI-hardening pattern.

Mitigating signals:

  • Deep issue cross-referencing (matching POV crash reports to specific line numbers) suggests genuine human analysis.
  • The author's acknowledgment of HDF5 2.0.0 leak origins (requiring domain expertise) indicates human review.
  • The discussion thread shows active engagement and technical reasoning by the author.

Human-in-the-loop assessment: The domain knowledge, issue triage, and PR discussion strongly indicate an experienced human author directing the work. AI assistance appears likely for boilerplate test scaffolding and CI YAML, with human authorship for the core analysis and fixes.


Summary and Recommendation

What This PR Does Well

  • ✅ The nc4internal.c fix (issue detected memory leaks in NC_hashmapnew netcdf/libdispatch/nchashmap.c #2665) is correct, clean, and complete. It properly cleans up all resources on the error path.
  • ✅ The HDF5 2.0.0 CI infrastructure is well-engineered with proper cache invalidation, matrix exclusions, and consistent changes across both build systems.
  • ✅ The tst_h5_open_close.c isolation test is a valuable diagnostic tool for identifying where open/close growth originates.
  • ✅ The retry logic for Windows CI resolves a real intermittent failure mode.
  • ✅ Both CMake and Autotools build systems are updated consistently.

What Needs Attention Before Merge

  1. Issues SEGV in get_attached_info netcdf/libhdf5/hdf5open.c #2664, heap-buffer-overflow in get_attached_info netcdf/libhdf5/hdf5open.c #2666, memcpy-param-overlap in get_attached_info netcdf/libhdf5/hdf5open.c #2667: The PR should either (a) document that these crashes are fundamentally inside HDF5 and cannot be fully fixed at the netCDF-C level, or (b) add a caller-side check before invoking H5DSget_num_scales to validate the dataset's dimension count. Simply claiming these are fixed overstates what the code change achieves.

  2. Issue heap-buffer-overflow in NC4_get_vars netcdf/libhdf5/hdf5var.c #2668: Either add the missing fix to hdf5var.c (the heap-buffer-overflow root cause) or remove the issue from the "Fixes" list. The test currently exercises only valid-file scenarios.

  3. Issue SegV with corrupt netCDF files #2436: Similarly, remove from "Fixes" if no code change addresses it, or add the corresponding fix.

  4. Comment correction: Change "current RSS""peak (high-water mark) RSS since process start" in both new test files.

  5. Copyright year: Update tst_mem_safety.c from "Copyright 2024" to "Copyright 2026".


Generated by GitHub Copilot (claude-sonnet-4.6) on 2026-04-01. Review performed against PR #3330 opened 2026-03-17, last updated 2026-04-01.

@edhartnett
Copy link
Copy Markdown
Contributor Author

I have made the changes called for in the code review. I removed the issues about which there is some controvery. I will circle around to them again after this is merged.

@edhartnett
Copy link
Copy Markdown
Contributor Author

WRT review comments and their length: this was an ambitious PR which attempted to close many issues. Too many, in the opinion of claude. Basically, all objections were to "Fixes #NNN" in the PR desc, not objections to code changes in the PR.

Let us see how claude does reviewing more focused PRs...

This PR is ready to merge.

@WardF
Copy link
Copy Markdown
Member

WardF commented Apr 1, 2026

Thanks Ed, taking a look now!

@WardF
Copy link
Copy Markdown
Member

WardF commented Apr 1, 2026

There are a couple things highlighted that I'll take a look at but nothing particularly alarming or worrying. More a question of using it as a starting point for things I might not have thought of. The report for reference:

Looks good overall, thanks!

@edhartnett
Copy link
Copy Markdown
Contributor Author

OK I will take a look tomorrow morning. I welcome the feedback and the improvements it leads to.

image

@WardF WardF merged commit 2c644e9 into Unidata:main Apr 7, 2026
115 checks passed
@WardF
Copy link
Copy Markdown
Member

WardF commented Apr 7, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak with opening/closing of files?

2 participants