Skip to content

[Lustre TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload#160

Open
lepan-google wants to merge 11 commits intoAI-Hypercomputer:mainfrom
lepan-google:main
Open

[Lustre TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload#160
lepan-google wants to merge 11 commits intoAI-Hypercomputer:mainfrom
lepan-google:main

Conversation

@lepan-google
Copy link
Contributor

TESTED=local tests

@lepan-google lepan-google changed the title [GCS TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload [Lustre TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload Feb 18, 2026
@lepan-google lepan-google marked this pull request as ready for review February 18, 2026 23:19
--enable-legacy-lustre-port
```

Note:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, corrected the note here.

gcloud container clusters update ${CLUSTER_NAME} \
--location ${REGION} \
--project ${PROJECT_ID} \
--enable-legacy-lustre-port
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to use this option the Lustre instance needs to be created with a certain option selected. Let's make sure the Lustre creation instructions make a note of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkmg. According to the public documentation (https://docs.cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/lustre-csi-driver-new-volume#lustre_communication_ports), the --enable-legacy-lustre-port flag is required if an existing Managed Lustre instance was created with the gke-support-enabled flag, but not vice versa. Is this the option you are referring to?

We can also use the default communication port. But the Access existing Managed Lustre instances page is still relying on this --enable-legacy-lustre-port flag: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/lustre-csi-driver-existing-instance. This flag also seems more compatible across GKE versions.

Do we have any preference here? cc @miroslavln

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use the use-legacy-lustre-port anymore both when creating the Lustre instance and the cluster. This was an issue a few months ago but is now fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @miroslavln!

I updated the instructions to follow the publish recipe. Only use legacy port for GKE clusters run a version earlier than 1.33.2-gke.4780000 or an existing Managed Lustre instance that was created with the gke-support-enabled flag.

Could you please help me take a look again, thank you!


For a cluster with
[Workload Identity Federation](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled
, please follow
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rendering with a space between "enabled" and ","

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the format!

Comment on lines 146 to 151
1. Access into the mount point and create the model folder.

2. Under the mount point,
[download](https://huggingface.co/docs/hub/en/models-downloading)
the model using the hf command:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the model folder need to have a specific name? What path should users run the hf command from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will leave the naming to the users. Corrected the description of step 2 and 3.


## Deploy vLLM Workload on GKE

The recipe utilizes 50 nodes, totaling 200 TPUs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note mentioning that this can be changed (min: single node; max: number of 2x2x1 nodepools in your cluster)

| Variable | Description | Example |
| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `LUSTRE_INSTANCE_NAME` | The name of your Lustre instance. | `my-lustre` |
| `LUSTRE_MODEL_FOLDER_PATH` | The path to the model folder on the Lustre instance. | `my-model-folder` |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relative to "/"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description here, please take a look!

| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `LUSTRE_INSTANCE_NAME` | The name of your Lustre instance. | `my-lustre` |
| `LUSTRE_MODEL_FOLDER_PATH` | The path to the model folder on the Lustre instance. | `my-model-folder` |
| `LUSTRE_XLA_CACHE_PATH` | The path to the XLA compilation cache folder on the Lustre instance. Specify the folder where you want to store the XLA compilation cache during the first run; subsequent server startups will then read the cache from that location. | `my-model-folder` |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, is this a path relative to "/"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lepan-google lepan-google requested a review from mkmg February 26, 2026 20:35
--release-channel=rapid \
--num-nodes=1 \
--addons LustreCsiDriver,HttpLoadBalancing \
--enable-legacy-lustre-port

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling the legacy port is not recommended anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @miroslavln!

Removed the legacy port flag for newly created cluster.

Check if the following features are enabled in the cluster, if not use the
following steps to enable the required features.

1. **Enable Workload Identity:** The cluster and the nodepool needs to have

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require workload identity? I dont think lustre needs it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for Lustre access setup on cluster.

Is the description in the Grant Storage Permission to Kubernetes Service Account section accurate?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 While GCSFuse CSI driver uses workload identity, I don't believe Lustre CSI driver does. Unless this is in the compute only recipe we can probably safely remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I totally forgot only Lustre operations need IAM permission: https://docs.cloud.google.com/managed-lustre/docs/access-control

Removed the workload identity sections.

name: vllm-pvc
csi:
driver: lustre.csi.storage.gke.io
volumeHandle: {LUSTRE_PROJECT_ID}/{LUSTRE_LOCATION}/{LUSTRE_INSTANCE_NAME}. # Please replace this with your actual Lustre instance name, location and project ID.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "." after {LUSTRE_INSTANCE_NAME}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, done!


### Create new cluster

Note: If a cluster already exists follows the steps in next section
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"follows" -> "follow"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@lepan-google lepan-google requested a review from mkmg February 27, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants