diff --git a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md index f594e71a..c4d276e5 100644 --- a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk +``` + ### Run deepseek3-671b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek3-671b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -282,13 +292,19 @@ Please note that `fsdp_shard_on_exp=true` only works if num of experts is divisi ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8/xpk/README.md index 3780af64..acca9542 100644 --- a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x8x8/xpk +``` + ### Run deepseek3-671b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek3-671b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/README.md index a45163ae..08f8f4ba 100644 --- a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/README.md @@ -239,15 +239,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk +``` + ### Run deepseek-v3 Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek-v3 pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -298,13 +308,19 @@ To realize these gains, the recipe employs a w8a8g8 (8-bit weights, activations ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x8x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x8x8/xpk/README.md index ac4295db..dca1d418 100644 --- a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x8x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x8x8/xpk/README.md @@ -239,15 +239,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x8x8/xpk +``` + ### Run deepseek3-671b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek3-671b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -298,13 +308,19 @@ To realize these gains, the recipe employs a w8a8g8 (8-bit weights, activations ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x4x4/xpk/README.md b/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x4x4/xpk/README.md index 1ebf9262..91596954 100644 --- a/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x4x4/xpk/README.md +++ b/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x4x4/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x4x4/xpk +``` + ### Run gpt-oss-120b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the gpt-oss-120b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x8x8/xpk/README.md b/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x8x8/xpk/README.md index eaeb44fb..f48ea929 100644 --- a/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x8x8/xpk/README.md +++ b/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x8x8/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/gpt-oss-120b/8k-bf16-tpu7x-4x8x8/xpk +``` + ### Run gpt-oss-120b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the gpt-oss-120b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-405b/8k-bf16-tpu7x-4x8x8/README.md b/training/ironwood/llama3.1-405b/8k-bf16-tpu7x-4x8x8/README.md index 4f37e0f1..c46d86c1 100644 --- a/training/ironwood/llama3.1-405b/8k-bf16-tpu7x-4x8x8/README.md +++ b/training/ironwood/llama3.1-405b/8k-bf16-tpu7x-4x8x8/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-405b/8k-bf16-tpu7x-4x8x8 +``` + ### Run llama3.1-405b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-405b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-405b/8k-fp8-tpu7x-4x8x8/README.md b/training/ironwood/llama3.1-405b/8k-fp8-tpu7x-4x8x8/README.md index 669a3739..14d5215a 100644 --- a/training/ironwood/llama3.1-405b/8k-fp8-tpu7x-4x8x8/README.md +++ b/training/ironwood/llama3.1-405b/8k-fp8-tpu7x-4x8x8/README.md @@ -239,15 +239,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-405b/8k-fp8-tpu7x-4x8x8 +``` + ### Run llama3.1-405b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-405b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -276,13 +286,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-70b/128k-bf16-tpu7x-4x8x8/xpk/README.md b/training/ironwood/llama3.1-70b/128k-bf16-tpu7x-4x8x8/xpk/README.md index 9e3edbc8..039e70ba 100644 --- a/training/ironwood/llama3.1-70b/128k-bf16-tpu7x-4x8x8/xpk/README.md +++ b/training/ironwood/llama3.1-70b/128k-bf16-tpu7x-4x8x8/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-70b/128k-bf16-tpu7x-4x8x8/xpk +``` + ### Run llama3.1-70b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-70b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-70b/128k-fp8-tpu7x-4x8x8/README.md b/training/ironwood/llama3.1-70b/128k-fp8-tpu7x-4x8x8/README.md index 69afb4b6..4a51632c 100644 --- a/training/ironwood/llama3.1-70b/128k-fp8-tpu7x-4x8x8/README.md +++ b/training/ironwood/llama3.1-70b/128k-fp8-tpu7x-4x8x8/README.md @@ -237,16 +237,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-70b/128k-fp8-tpu7x-4x8x8 +``` + ### Run llama3.1-70b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-70b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh - +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +284,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x4x4/xpk/README.md b/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x4x4/xpk/README.md index 0eb37bc4..5c732607 100644 --- a/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x4x4/xpk/README.md +++ b/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x4x4/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x4x4/xpk +``` + ### Run llama3.1-70b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-70b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x8x8/xpk/README.md b/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x8x8/xpk/README.md index 88d053a4..cd514c75 100644 --- a/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x8x8/xpk/README.md +++ b/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x8x8/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x8x8/xpk +``` + ### Run llama3.1-70b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-70b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x4x4/xpk/README.md b/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x4x4/xpk/README.md index a697224f..6039946c 100644 --- a/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x4x4/xpk/README.md +++ b/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x4x4/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x4x4/xpk +``` + ### Run llama3.1-70b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-70b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x8x8/README.md b/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x8x8/README.md index b2135377..4617adf1 100644 --- a/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x8x8/README.md +++ b/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x8x8/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/llama3.1-70b/8k-fp8-tpu7x-4x8x8 +``` + ### Run llama3.1-70b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the llama3.1-70b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/qwen3-235b-a22b/4k-bf16-tpu7x-4x8x8/xpk/README.md b/training/ironwood/qwen3-235b-a22b/4k-bf16-tpu7x-4x8x8/xpk/README.md index 480e60e9..f12e700b 100644 --- a/training/ironwood/qwen3-235b-a22b/4k-bf16-tpu7x-4x8x8/xpk/README.md +++ b/training/ironwood/qwen3-235b-a22b/4k-bf16-tpu7x-4x8x8/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/qwen3-235b-a22b/4k-bf16-tpu7x-4x8x8/xpk +``` + ### Run qwen3-235b-a22b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the qwen3-235b-a22b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -275,13 +285,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/qwen3-235b-a22b/4k-fp8-tpu7x-4x8x8/README.md b/training/ironwood/qwen3-235b-a22b/4k-fp8-tpu7x-4x8x8/README.md index dd2e8585..143ebdaf 100644 --- a/training/ironwood/qwen3-235b-a22b/4k-fp8-tpu7x-4x8x8/README.md +++ b/training/ironwood/qwen3-235b-a22b/4k-fp8-tpu7x-4x8x8/README.md @@ -240,15 +240,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/qwen3-235b-a22b/4k-fp8-tpu7x-4x8x8 +``` + ### Run qwen3-235b-a22b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the qwen3-235b-a22b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -277,13 +287,19 @@ are expected to use the defaults within the specified `WORKLOAD_IMAGE`. ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. diff --git a/training/ironwood/wan2.1-14b/bf16-tpu7x-4x4x4/xpk/README.md b/training/ironwood/wan2.1-14b/bf16-tpu7x-4x4x4/xpk/README.md index 013e0615..98559794 100644 --- a/training/ironwood/wan2.1-14b/bf16-tpu7x-4x4x4/xpk/README.md +++ b/training/ironwood/wan2.1-14b/bf16-tpu7x-4x4x4/xpk/README.md @@ -264,14 +264,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/wan2.1-14b/bf16-tpu7x-4x4x4/xpk +``` + ### Run wan Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the wan pretraining workload. -To run the benchmark, simply execute the script: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash +chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -291,13 +302,19 @@ Note that any MaxDiffusion configurations not explicitly overridden in ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console.