From 5e32768c66b7d0819fb0e3cece91e08e0d879a00 Mon Sep 17 00:00:00 2001 From: sivanishwanthm Date: Wed, 28 Jan 2026 20:36:41 +0000 Subject: [PATCH 1/4] Updated --- .../deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md index f594e71a..74f86358 100644 --- a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md @@ -238,6 +238,13 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk +``` + ### Run deepseek3-671b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and @@ -247,6 +254,7 @@ To run the benchmark, first make the script executable and then run it: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -286,7 +294,7 @@ and logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 +kubectl logs -f ${PDD_NAME} ``` You can also monitor your cluster and TPU usage through the Google Cloud From 74134c050a12e0dce770b3e63313c3f3a971ed01 Mon Sep 17 00:00:00 2001 From: sivanishwanthm Date: Wed, 28 Jan 2026 22:55:37 +0000 Subject: [PATCH 2/4] Updated Readme --- .../4k-bf16-tpu7x-4x4x8/xpk/README.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md index 74f86358..71de7cf3 100644 --- a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md @@ -250,7 +250,9 @@ cd tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek3-671b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use nano to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh @@ -290,13 +292,19 @@ Please note that `fsdp_shard_on_exp=true` only works if num of experts is divisi ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f ${PDD_NAME} + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console. From 532914b314713959c8beafc6d107db059d2e6a21 Mon Sep 17 00:00:00 2001 From: sivanishwanthm Date: Wed, 28 Jan 2026 23:01:40 +0000 Subject: [PATCH 3/4] Updated Readme --- .../ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md index 71de7cf3..3c1ad677 100644 --- a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md @@ -250,7 +250,7 @@ cd tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek3-671b pretraining workload. -Before execution, use nano to edit the script and configure the environment variables to match your specific environment. +Before execution, use 'nano' to edit the script and configure the environment variables to match your specific environment. To configure and run the benchmark: From 53d062c6cb7db9b30a7da8629bf77670b39ef388 Mon Sep 17 00:00:00 2001 From: sivanishwanthm Date: Wed, 28 Jan 2026 23:03:09 +0000 Subject: [PATCH 4/4] Updated Readme --- .../ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md index 3c1ad677..c4d276e5 100644 --- a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md @@ -250,7 +250,7 @@ cd tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek3-671b pretraining workload. -Before execution, use 'nano' to edit the script and configure the environment variables to match your specific environment. +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. To configure and run the benchmark: