-
Notifications
You must be signed in to change notification settings - Fork 621
[CI] Widespread cudaErrorInvalidDevice failure and Segmentation Faults #7665
Copy link
Copy link
Closed
Labels
bugSomething isn't workingSomething isn't workingcidependency-breakIssue is related to an upstream breaking change.Issue is related to an upstream breaking change.
Description
Summary
Widespread CI failures across multiple PRs, test types, CUDA versions, and Python versions. All failures share the same underlying CUDA error: cudaErrorInvalidDevice: invalid device ordinal.
Failing jobs: conda-cpp-tests, conda-notebook-tests, docs-build, conda-python-tests-singlegpu, etc.
Example failures:
- https://github.com/rapidsai/cuml/actions/runs/20931777189/job/60153661146?pr=7661
- https://github.com/rapidsai/cuml/actions/runs/20931777189/job/60153661356?pr=7661
- https://github.com/rapidsai/cuml/actions/runs/20934571592/job/60154568407?pr=7662
- and many others
Common Factors
- GPU: all GPUs
- Error:
cudaErrorInvalidDevice: invalid device ordinal
Error Messages
RuntimeError: CUDA error encountered at: file=***: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidDevice:invalid device ordinal
Root Cause Analysis
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcidependency-breakIssue is related to an upstream breaking change.Issue is related to an upstream breaking change.