diff --git a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md index f594e71a..c4d276e5 100644 --- a/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md +++ b/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md @@ -238,15 +238,25 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` +## Get the recipe +```bash +cd ~ +git clone https://github.com/ai-hypercomputer/tpu-recipes.git +cd tpu-recipes/training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk +``` + ### Run deepseek3-671b Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and configurations to launch the deepseek3-671b pretraining workload. -To run the benchmark, first make the script executable and then run it: +Before execution, use `nano ./run_recipe.sh` to edit the script and configure the environment variables to match your specific environment. + +To configure and run the benchmark: ```bash chmod +x run_recipe.sh +nano ./run_recipe.sh ./run_recipe.sh ``` @@ -282,13 +292,19 @@ Please note that `fsdp_shard_on_exp=true` only works if num of experts is divisi ## Monitor the job To monitor your job's progress, you can use kubectl to check the Jobset status -and logs: +and stream logs: ```bash kubectl get jobset -n default ${WORKLOAD_NAME} -kubectl logs -f -n default jobset/${WORKLOAD_NAME}-0-worker-0 + +# List pods to find the specific name (e.g., deepseek3-0-0-xxxx) +kubectl get pods | grep ${WORKLOAD_NAME} ``` +Then, stream the logs from the running pod (replace with the name you found): +```bash +kubectl logs -f +``` You can also monitor your cluster and TPU usage through the Google Cloud Console.