Skip to content

[CI] Widespread cudaErrorInvalidDevice failure and Segmentation Faults #7665

@csadorf

Description

@csadorf

Summary

Widespread CI failures across multiple PRs, test types, CUDA versions, and Python versions. All failures share the same underlying CUDA error: cudaErrorInvalidDevice: invalid device ordinal.

Failing jobs: conda-cpp-tests, conda-notebook-tests, docs-build, conda-python-tests-singlegpu, etc.

Example failures:

Common Factors

  • GPU: all GPUs
  • Error: cudaErrorInvalidDevice: invalid device ordinal

Error Messages

RuntimeError: CUDA error encountered at: file=***: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidDevice:invalid device ordinal

Root Cause Analysis

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcidependency-breakIssue is related to an upstream breaking change.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions