Releases: GoogleCloudPlatform/cluster-toolkit
v1.84.0
What's Changed
Key New Features 🎉
- Validate disk type in zone by @saara-tyagi27 in #5232
Version Updates ⏫
- Update gke-versioning in gpu_direct.tf by @agrawalkhushi18 in #5284
Bug fixes 🐞
- Update nccl test script to fix enroot directory issue in A3H by @agrawalkhushi18 in #5324
Full Changelog: v1.83.0...v1.84.0
v1.83.0
What's Changed
Key New Features 🎉
- feat(validations): Add early conditional validation by @AdarshK15 in #5160
- A4x Max BM slurm support. by @arpit974 in #5222
- Adding GKE TPU DWS Queued Provisioning support for v6e and 7x by @shubpal07 in #5218
- feat(validations): Add early required validation by @AdarshK15 in #5166
- Module deprecation warning system by @vikramvs-gg in #5229
- A4X-Max Bare Metal GKE toolkit blueprint by @vikramvs-gg in #5211
Breaking Changes 🚨
- Update and pin terraform version to 1.12.2 by @parulbajaj01 in #5216
- Update wait flag and resolving helm_release deadlock destruction error by @agrawalkhushi18 in #5147
Module Improvements 🔨
- Migrate configure_kueue from gavinbunney to helm by @agrawalkhushi18 in #5129
- Migrate install_gib from kubectl to helm by @agrawalkhushi18 in #5256
Improvements 🛠
- Add reservation name check validator by @saara-tyagi27 in #5185
- Update go files to add timestamps to gcluster logs by @agrawalkhushi18 in #5198
- Pin Dcgm version 4.5.1-1 by @saara-tyagi27 in #5197
- Add support for DualStack (IPv4/IPv6) networks by @DomiKoPL in #5206
Bug fixes 🐞
- Update slurm_cluster_name regex by @saara-tyagi27 in #5261
- Fix SELinux issue in hpc-build-slurm-image blueprint by @AdarshK15 in #5266
- Hotfix: update G4 NVIDIA drivers for kernel 6.17 compatibility by @SwarnaBharathiMantena in #5289
- Hardcode zone in a2high PR test to fix test failures by @kadupoornima in #5305
- Modifying prefix_length for PSA to accomodate sufficient IPs for peering by @vikramvs-gg in #5306
- fix: Update a3m and a3u script to resolve slurm nccl test failure by @agrawalkhushi18 in #5308
New Contributors
Full Changelog: v1.82.0...v1.83.0
v1.82.0
What's Changed
Key New Features 🎉
- A4X JBVM by @LAVEEN in #4950
- Introduced a binary ZIP archive to the release assets by @kvenkatachala333 in #5208
Module Improvements 🔨
Improvements 🛠
- Fix the babysit files limitation with pagination logic by @SwarnaBharathiMantena in #5191
- Adding A4X Base Support to JBVM by @LAVEEN in #4834
Version Updates ⏫
- Update SLURM blueprints to point to the latest slurm-gcp release by @Neelabh94 in #5215
New Contributors
- @spaturi13 made their first contribution in #5184
Full Changelog: v1.81.0...v1.82.0
v1.81.0
What's Changed
Key New Features 🎉
-
Switch to using gcsfuse profile feature in aiml gcs-bucket mounts in slurm cluster blueprints by @gargnitingoogle in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5047
-
DWS Flex start support in TPU 7x and v6e by @shubpal07 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5111
Improvements 🛠
-
Improved validations enabling early enforcement of numeric boundaries and length constraints within metadata.yaml files across several core and community modules by @AdarshK15 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5115
-
Update Dockerfile and README.md instructions for a3mega nemo framework by @mufaqam-gcl in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5164
-
TPU v6e DWS flex integration tests by @shubpal07 in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5135
-
chore/allow hyphens in partition_name and slurm_cluster_name, increase max length to 20 for slurm_cluster_name by @rbekhtaoui in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/4316
New Contributors
@gargnitingoogle made their first contribution in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5047
@gokamesh made their first contribution in https://github.com/GoogleCloudPlatform/cluster-toolkit/pull/5169
Full Changelog: https://github.com/GoogleCloudPlatform/cluster-toolkit/compare/v1.80.0...v1.81.0
v1.80.0
What's Changed
Module Improvements 🔨
- Compress the H4D blueprint with multivpc and vpc module update by @SwarnaBharathiMantena in #5133
Improvements 🛠
- Adding IPV6 & IDPF support by @LAVEEN in #5066
- R&R Slurm integration by @sarthakag in #5003
Full Changelog: v1.79.0...v1.80.0
v1.79.0
v1.78.0
What's Changed
Breaking Changes 🚨
- Fix private address space for gke-a3-megagpu.yaml by @omartin2010 in #4478
Improvements 🛠
- Add precondition checks to disallow setting conflicting consumption options by @kadupoornima in #5062
Deprecations 💤
- Add deprecation notice for paralellstore module by @parulbajaj01 in #5083
- Deprecate a3u-gcs blueprint as its no longer maintained by @bytetwin in #4871
Version Updates ⏫
- Add gIB versions v1.1.1 and v1.1.0 for arm64 by @duncanspani in #5090
New Contributors
- @AdarshK15 made their first contribution in #5095
- @duncanspani made their first contribution in #5090
- @siddhartha-quad made their first contribution in #4792
Full Changelog: v1.77.0...v1.78.0
v1.77.0
What's Changed
Key New Features 🎉
- Integrate Kueue support for GKE TPU v6 and v7x blueprints by @agrawalkhushi18 in #5007
- feat: Enable Block topology for A4X by @Neelabh94 in #5021
- Support shared reservations in gke-node-pool module by @SwarnaBharathiMantena in #5040
- Add automated GCP resource cleanup script and Cloud Build pipeline by @simrankaurb in #5039
- Add integration test for A3 high-GPU with spot VMs by @simrankaurb in #4984
- feat: Add community module for executing gcloud commands by @cboneti in #4923
Breaking Changes 🚨
- Graduate network/private-service-access to core modules by @SwarnaBharathiMantena in #5029
Improvements 🛠
- Refactor fio job template with best practices by @parulbajaj01 in #4977
- Enable h4d-vm test to run on Spot VMs by @simrankaurb in #5022
- Adding Robust destroy in cluster toolkit by @shubpal07 in #4866
Bug fixes 🐞
- Adding G4 configuration by @LAVEEN in #5024
- Use ternary operator for anywhere_cache precondition in main.tf by @Neelabh94 in #5033
Full Changelog: v1.76.0...v1.77.0
v1.76.0
What's Changed
Key New Features 🎉
- feat: Add support for Anywhere Cache in cloud-storage-bucket by @Neelabh94 in #4889
- Adding test for A3 UltraGPU JBVMs with Spot VMs by @simrankaurb in #4968
- On Spot A4 by @LAVEEN in #4953
- Enable Spot VM testing for GKE with A3 mega GPUs by @simrankaurb in #4951
- Enable Spot VM testing for a3-megagpu instances by @simrankaurb in #4901
- Add a post-deploy test specific to TPUs by @agrawalkhushi18 in #4969
Breaking Changes 🚨
- Move community/modules/project/service-account module to core modules directory by @SwarnaBharathiMantena in #4958
Module Improvements 🔨
- Make waiting for kueue installation configurable, and wait for kueue in the G4 GKE blueprint by @kadupoornima in #4973
Improvements 🛠
- Update GKE A4X Readme by @parulbajaj01 in #4955
- Add example nccl test script for slurm on gke by @ACW101 in #4960
Deprecations 💤
- Remove all references to ubuntu20.04 by @sarthakag in #4963
Bug fixes 🐞
Full Changelog: v1.75.1...v1.76.0
v1.75.1
What's Changed
Module Improvements 🔨
- Add exclusion_end_time_behavior and update release channel maintenance window by @SwarnaBharathiMantena in #4990
Full Changelog: v1.75.0...v1.75.1