Skip to content

Conversation

@charliermarsh
Copy link
Member

@charliermarsh charliermarsh commented Aug 22, 2025

Summary

This initially included NVIDIA_VISIBLE_DEVICES masking, though it's now omitted for simplicity.

Closes #14647.

@charliermarsh charliermarsh requested a review from geofft August 22, 2025 17:27
@charliermarsh charliermarsh added the bug Something isn't working label Aug 22, 2025
@charliermarsh charliermarsh marked this pull request as ready for review August 22, 2025 17:28
Copy link
Contributor

@geofft geofft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, but some nitpicks:

  1. I don't actually think there's a point to us parsing NVIDIA_VISIBLE_DEVICES here. This variable is specifically used by nvidia-container-toolkit to determine which devices ought to be exposed inside the container. It doesn't seem to be something that's used in non-container tools at all. I think the reference to it in #14647 was just mentioning the lack of ability to use this as a workaround for being unable to parse multiple lines, but if we handle multiple lines I'm not sure we need the workaround. It doesn't really matter what we do here since the driver version ought to be the same for all lines of output, but if we extend this code to be about compute capabilities etc., I think we should put a tad more thought into whether we want this to be the interface, since I think we would be novel in using this environment variable in a non-container tool (e.g. I don't think that nvidia-variant-provider uses it).
  2. Somewhat weirdly the parsing for the variable in nvidia-container-toolkit appears to allow all/none/void to be individual elements in the comma-separated list, as opposed to having to be the entire string, e.g., NVIDIA_VISIBLE_DEVICES=2,all is accepted and interpreted as all. See https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.17.8/internal/config/image/cuda_image.go#L123-L154 which splits on commas and passes a list to https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.17.8/internal/config/image/devices.go which loops through the list looking for these special keywords.

@charliermarsh
Copy link
Member Author

Okay, sounds good. I removed the NVIDIA_VISIBLE_DEVICES parsing for now.

@charliermarsh charliermarsh enabled auto-merge (squash) November 2, 2025 21:07
@charliermarsh charliermarsh enabled auto-merge (squash) November 2, 2025 21:07
@charliermarsh charliermarsh merged commit 6da135a into main Nov 2, 2025
99 checks passed
@charliermarsh charliermarsh deleted the charlie/multi branch November 2, 2025 21:21
tmeijn pushed a commit to tmeijn/dotfiles that referenced this pull request Nov 10, 2025
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [astral-sh/uv](https://github.com/astral-sh/uv) | patch | `0.9.7` -> `0.9.8` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>astral-sh/uv (astral-sh/uv)</summary>

### [`v0.9.8`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#098)

[Compare Source](astral-sh/uv@0.9.7...0.9.8)

Released on 2025-11-07.

##### Enhancements

- Accept multiple packages in `uv export` ([#&#8203;16603](astral-sh/uv#16603))
- Accept multiple packages in `uv sync` ([#&#8203;16543](astral-sh/uv#16543))
- Add a `uv cache size` command ([#&#8203;16032](astral-sh/uv#16032))
- Add prerelease guidance for build-system resolution failures ([#&#8203;16550](astral-sh/uv#16550))
- Allow Python requests to include `+gil` to require a GIL-enabled interpreter ([#&#8203;16537](astral-sh/uv#16537))
- Avoid pluralizing 'retry' for single value ([#&#8203;16535](astral-sh/uv#16535))
- Enable first-class dependency exclusions ([#&#8203;16528](astral-sh/uv#16528))
- Fix inclusive constraints on available package versions in resolver errors ([#&#8203;16629](astral-sh/uv#16629))
- Improve `uv init` error for invalid directory names ([#&#8203;16554](astral-sh/uv#16554))
- Show help on `uv build -h` ([#&#8203;16632](astral-sh/uv#16632))
- Include the Python variant suffix in "Using Python ..." messages ([#&#8203;16536](astral-sh/uv#16536))
- Log most recently modified file for cache-keys ([#&#8203;16338](astral-sh/uv#16338))
- Update Docker builds to use nightly Rust toolchain with musl v1.2.5 ([#&#8203;16584](astral-sh/uv#16584))
- Add GitHub attestations for uv release artifacts ([#&#8203;11357](astral-sh/uv#11357))

##### Configuration

- Expose `UV_NO_GROUP` as an environment variable ([#&#8203;16529](astral-sh/uv#16529))
- Add `UV_NO_SOURCES` as an environment variable ([#&#8203;15883](astral-sh/uv#15883))

##### Bug fixes

- Allow `--check` and `--locked` to be used together in `uv lock` ([#&#8203;16538](astral-sh/uv#16538))
- Allow for unnormalized names in the METADATA file ([#&#8203;16547](astral-sh/uv#16547)) ([#&#8203;16548](astral-sh/uv#16548))
- Fix missing value\_type for `default-groups` in schema ([#&#8203;16575](astral-sh/uv#16575))
- Respect multi-GPU outputs in `nvidia-smi` ([#&#8203;15460](astral-sh/uv#15460))
- Fix DNS lookup errors in Docker containers ([#&#8203;8450](astral-sh/uv#8450))

##### Documentation

- Fix typo in uv tool list doc ([#&#8203;16625](astral-sh/uv#16625))
- Note `uv pip list` name normalization in docs ([#&#8203;13210](astral-sh/uv#13210))

##### Other changes

- Update Rust toolchain to 1.91 and MSRV to 1.89 ([#&#8203;16531](astral-sh/uv#16531))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS4xNzMuMSIsInVwZGF0ZWRJblZlciI6IjQxLjE3My4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJSZW5vdmF0ZSBCb3QiXX0=-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--torch-backend=auto fails on systems with multiple GPUs and without /proc/driver/nvidia/version

3 participants