[Lustre TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload#160
[Lustre TPU7X NPI Recipe] Create a public Lustre recipe for TPU7X vLLM GPT-OSS 120B inference workload#160lepan-google wants to merge 11 commits intoAI-Hypercomputer:mainfrom
Conversation
…PT-OSS 120B inference workload TESTED=local tests
| --enable-legacy-lustre-port | ||
| ``` | ||
|
|
||
| Note: |
There was a problem hiding this comment.
Sorry, corrected the note here.
| gcloud container clusters update ${CLUSTER_NAME} \ | ||
| --location ${REGION} \ | ||
| --project ${PROJECT_ID} \ | ||
| --enable-legacy-lustre-port |
There was a problem hiding this comment.
In order to use this option the Lustre instance needs to be created with a certain option selected. Let's make sure the Lustre creation instructions make a note of this.
There was a problem hiding this comment.
Thanks @mkmg. According to the public documentation (https://docs.cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/lustre-csi-driver-new-volume#lustre_communication_ports), the --enable-legacy-lustre-port flag is required if an existing Managed Lustre instance was created with the gke-support-enabled flag, but not vice versa. Is this the option you are referring to?
We can also use the default communication port. But the Access existing Managed Lustre instances page is still relying on this --enable-legacy-lustre-port flag: https://docs.cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/lustre-csi-driver-existing-instance. This flag also seems more compatible across GKE versions.
Do we have any preference here? cc @miroslavln
There was a problem hiding this comment.
We should not use the use-legacy-lustre-port anymore both when creating the Lustre instance and the cluster. This was an issue a few months ago but is now fixed
There was a problem hiding this comment.
Thanks @miroslavln!
I updated the instructions to follow the publish recipe. Only use legacy port for GKE clusters run a version earlier than 1.33.2-gke.4780000 or an existing Managed Lustre instance that was created with the gke-support-enabled flag.
Could you please help me take a look again, thank you!
|
|
||
| For a cluster with | ||
| [Workload Identity Federation](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled | ||
| , please follow |
There was a problem hiding this comment.
This is rendering with a space between "enabled" and ","
There was a problem hiding this comment.
Fixed the format!
| 1. Access into the mount point and create the model folder. | ||
|
|
||
| 2. Under the mount point, | ||
| [download](https://huggingface.co/docs/hub/en/models-downloading) | ||
| the model using the hf command: | ||
|
|
There was a problem hiding this comment.
Does the model folder need to have a specific name? What path should users run the hf command from?
There was a problem hiding this comment.
We will leave the naming to the users. Corrected the description of step 2 and 3.
|
|
||
| ## Deploy vLLM Workload on GKE | ||
|
|
||
| The recipe utilizes 50 nodes, totaling 200 TPUs. |
There was a problem hiding this comment.
Add a note mentioning that this can be changed (min: single node; max: number of 2x2x1 nodepools in your cluster)
| | Variable | Description | Example | | ||
| | --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | | ||
| | `LUSTRE_INSTANCE_NAME` | The name of your Lustre instance. | `my-lustre` | | ||
| | `LUSTRE_MODEL_FOLDER_PATH` | The path to the model folder on the Lustre instance. | `my-model-folder` | |
There was a problem hiding this comment.
Updated the description here, please take a look!
| | --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | | ||
| | `LUSTRE_INSTANCE_NAME` | The name of your Lustre instance. | `my-lustre` | | ||
| | `LUSTRE_MODEL_FOLDER_PATH` | The path to the model folder on the Lustre instance. | `my-model-folder` | | ||
| | `LUSTRE_XLA_CACHE_PATH` | The path to the XLA compilation cache folder on the Lustre instance. Specify the folder where you want to store the XLA compilation cache during the first run; subsequent server startups will then read the cache from that location. | `my-model-folder` | |
There was a problem hiding this comment.
Same question, is this a path relative to "/"?
There was a problem hiding this comment.
Corrected this path, example: https://paste.googleplex.com/4721596428320768#l=93.
| --release-channel=rapid \ | ||
| --num-nodes=1 \ | ||
| --addons LustreCsiDriver,HttpLoadBalancing \ | ||
| --enable-legacy-lustre-port |
There was a problem hiding this comment.
Enabling the legacy port is not recommended anymore
There was a problem hiding this comment.
Thanks @miroslavln!
Removed the legacy port flag for newly created cluster.
| Check if the following features are enabled in the cluster, if not use the | ||
| following steps to enable the required features. | ||
|
|
||
| 1. **Enable Workload Identity:** The cluster and the nodepool needs to have |
There was a problem hiding this comment.
Why do we require workload identity? I dont think lustre needs it
There was a problem hiding this comment.
It is for Lustre access setup on cluster.
Is the description in the Grant Storage Permission to Kubernetes Service Account section accurate?
There was a problem hiding this comment.
+1 While GCSFuse CSI driver uses workload identity, I don't believe Lustre CSI driver does. Unless this is in the compute only recipe we can probably safely remove this.
There was a problem hiding this comment.
Oh I see, I totally forgot only Lustre operations need IAM permission: https://docs.cloud.google.com/managed-lustre/docs/access-control
Removed the workload identity sections.
| name: vllm-pvc | ||
| csi: | ||
| driver: lustre.csi.storage.gke.io | ||
| volumeHandle: {LUSTRE_PROJECT_ID}/{LUSTRE_LOCATION}/{LUSTRE_INSTANCE_NAME}. # Please replace this with your actual Lustre instance name, location and project ID. |
There was a problem hiding this comment.
Good catch, done!
|
|
||
| ### Create new cluster | ||
|
|
||
| Note: If a cluster already exists follows the steps in next section |
TESTED=local tests