[Lustre TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload by lepan-google · Pull Request #160 · AI-Hypercomputer/tpu-recipes

lepan-google · 2026-01-30T23:38:25Z

TESTED=local tests

…PT-OSS 120B inference workload TESTED=local tests

mkmg · 2026-02-26T18:42:55Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+      --enable-legacy-lustre-port
+    ```
+
+    Note:


Sorry, corrected the note here.

mkmg · 2026-02-26T18:45:56Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+    gcloud container clusters update ${CLUSTER_NAME} \
+      --location ${REGION} \
+      --project ${PROJECT_ID} \
+      --enable-legacy-lustre-port


In order to use this option the Lustre instance needs to be created with a certain option selected. Let's make sure the Lustre creation instructions make a note of this.

Thanks @mkmg. According to the public documentation (https://docs.cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/lustre-csi-driver-new-volume#lustre_communication_ports), the --enable-legacy-lustre-port flag is required if an existing Managed Lustre instance was created with the gke-support-enabled flag, but not vice versa. Is this the option you are referring to?

We can also use the default communication port. But the Access existing Managed Lustre instances page is still relying on this --enable-legacy-lustre-port flag: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/lustre-csi-driver-existing-instance. This flag also seems more compatible across GKE versions.

Do we have any preference here? cc @miroslavln

We should not use the use-legacy-lustre-port anymore both when creating the Lustre instance and the cluster. This was an issue a few months ago but is now fixed

Thanks @miroslavln!

I updated the instructions to follow the publish recipe. Only use legacy port for GKE clusters run a version earlier than 1.33.2-gke.4780000 or an existing Managed Lustre instance that was created with the gke-support-enabled flag.

Could you please help me take a look again, thank you!

mkmg · 2026-02-26T18:51:53Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+
+For a cluster with
+[Workload Identity Federation](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled
+, please follow


This is rendering with a space between "enabled" and ","

Fixed the format!

mkmg · 2026-02-26T18:53:47Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+1. Access into the mount point and create the model folder.
+
+2. Under the mount point,
+[download](https://huggingface.co/docs/hub/en/models-downloading)
+the model using the hf command:
+


Does the model folder need to have a specific name? What path should users run the hf command from?

We will leave the naming to the users. Corrected the description of step 2 and 3.

mkmg · 2026-02-26T19:01:52Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+
+## Deploy vLLM Workload on GKE
+
+The recipe utilizes 50 nodes, totaling 200 TPUs.


Add a note mentioning that this can be changed (min: single node; max: number of 2x2x1 nodepools in your cluster)

mkmg · 2026-02-26T19:07:05Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+    | Variable              | Description                                                                                             | Example                                                 |
+    | --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+    | `LUSTRE_INSTANCE_NAME` | The name of your Lustre instance. | `my-lustre` |
+    | `LUSTRE_MODEL_FOLDER_PATH` | The path to the model folder on the Lustre instance. | `my-model-folder` |


Is this relative to "/"?

Updated the description here, please take a look!

mkmg · 2026-02-26T19:07:25Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+    | --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+    | `LUSTRE_INSTANCE_NAME` | The name of your Lustre instance. | `my-lustre` |
+    | `LUSTRE_MODEL_FOLDER_PATH` | The path to the model folder on the Lustre instance. | `my-model-folder` |
+    | `LUSTRE_XLA_CACHE_PATH` | The path to the XLA compilation cache folder on the Lustre instance. Specify the folder where you want to store the XLA compilation cache during the first run; subsequent server startups will then read the cache from that location. | `my-model-folder` |


Same question, is this a path relative to "/"?

Corrected this path, example: https://paste.googleplex.com/4721596428320768#l=93.

miroslavln · 2026-02-27T18:39:46Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+    --release-channel=rapid \
+    --num-nodes=1 \
+    --addons LustreCsiDriver,HttpLoadBalancing \
+    --enable-legacy-lustre-port


Enabling the legacy port is not recommended anymore

Thanks @miroslavln!

Removed the legacy port flag for newly created cluster.

miroslavln · 2026-02-27T19:00:17Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+Check if the following features are enabled in the cluster, if not use the
+following steps to enable the required features.
+
+1. **Enable Workload Identity:** The cluster and the nodepool needs to have


Why do we require workload identity? I dont think lustre needs it

It is for Lustre access setup on cluster.

Is the description in the Grant Storage Permission to Kubernetes Service Account section accurate?

+1 While GCSFuse CSI driver uses workload identity, I don't believe Lustre CSI driver does. Unless this is in the compute only recipe we can probably safely remove this.

Oh I see, I totally forgot only Lustre operations need IAM permission: https://docs.cloud.google.com/managed-lustre/docs/access-control

Removed the workload identity sections.

mkmg · 2026-02-27T19:09:41Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+        name: vllm-pvc
+      csi:
+        driver: lustre.csi.storage.gke.io
+        volumeHandle: {LUSTRE_PROJECT_ID}/{LUSTRE_LOCATION}/{LUSTRE_INSTANCE_NAME}. # Please replace this with your actual Lustre instance name, location and project ID.


Remove "." after {LUSTRE_INSTANCE_NAME}

Good catch, done!

mkmg · 2026-02-27T19:10:08Z

inference/ironwood/vLLM/GPT-OSS/README-lustre.md

+
+### Create new cluster
+
+Note: If a cluster already exists follows the steps in next section


"follows" -> "follow"

lepan-google added 2 commits January 30, 2026 23:34

[GCS TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM G…

489689c

…PT-OSS 120B inference workload TESTED=local tests

Final updates

5b24055

lepan-google changed the title ~~[GCS TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload~~ [Lustre TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload Feb 18, 2026

lepan-google marked this pull request as ready for review February 18, 2026 23:19

lepan-google added 2 commits February 24, 2026 21:38

Update the scale to 800 chips

dd8fe72

Update scale for the recipe

a695198

mkmg reviewed Feb 26, 2026

View reviewed changes

lepan-google added 3 commits February 26, 2026 20:16

Resovle comments

4ca9304

Resolve comments

2f2bc6e

Resolve comments

96045ba

lepan-google requested a review from mkmg February 26, 2026 20:35

Format

cfd2c07

miroslavln reviewed Feb 27, 2026

View reviewed changes

Use legacy port only for old GKE version

186b468

lepan-google requested a review from miroslavln February 27, 2026 18:51

miroslavln reviewed Feb 27, 2026

View reviewed changes

lepan-google requested a review from miroslavln February 27, 2026 19:04

mkmg reviewed Feb 27, 2026

View reviewed changes

Resovle comments

56a5861

lepan-google requested a review from mkmg February 27, 2026 19:15

Remove workload identity

92655aa

miroslavln approved these changes Feb 27, 2026

View reviewed changes

mkmg approved these changes Feb 27, 2026

View reviewed changes


		## Deploy vLLM Workload on GKE

		The recipe utilizes 50 nodes, totaling 200 TPUs.


		### Create new cluster

		Note: If a cluster already exists follows the steps in next section

Conversation

lepan-google commented Jan 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants