diff --git a/training/ironwood/README.md b/training/ironwood/README.md index d256022e..f857ab77 100644 --- a/training/ironwood/README.md +++ b/training/ironwood/README.md @@ -5,7 +5,7 @@ The training recipes contained in this folder are optimized for Ironwood TPU. He |
Model ID
| Number of chips | GBS | Sequence length | Precision | Step time (seconds) | TFLOPs/sec/chip | Tokens/sec/chip | |-----------------|--------------------|--------------|--------------------------|--------------------|-------------|--------------|-----------------------| | deepseek-v3 | 128 | 2048 | 4096 | bf16 | 27.02 | 607.53 | 2,425.75 | -| deepseek-v3 | 128 | 2048 | 4096 | fp8_full | 22.83 | 718.57 | 2,869.59 | +| deepseek-v3 | 128 | 2048 | 4096 | fp8_full | 22.47 | 730.60 | 2,917.15 | | deepseek-v3 | 256 | 4096 | 4096 | bf16 | 26.79 | 612.66 | 2,446.25 | | deepseek-v3 | 256 | 4096 | 4096 | fp8_full | 22.08 | 743.46 | 2,968.49 | | gpt-oss-120b | 64 | 1280 | 8192 | bf16 | 17.10 | 330.25 | 9,581.66 | diff --git a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/k8s/README.md b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/k8s/README.md new file mode 100644 index 00000000..41e96891 --- /dev/null +++ b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/k8s/README.md @@ -0,0 +1,141 @@ +# Pretrain deepseek-v3 workload on Ironwood GKE clusters with Kubernetes JobSet + +This recipe outlines the steps for running a deepseek-v3 +[MaxText](https://github.com/AI-Hypercomputer/maxtext) pretraining workload on +[Ironwood GKE clusters](https://cloud.google.com/kubernetes-engine) +by applying a Kubernetes manifest to deploy a JobSet resource. + +## Workload Details + +This workload is configured with the following details: + +- Sequence Length: 4096 +- Precision: fp8 +- Chips: 128 (4x4x8 topology) + +## Prerequisites + +This recipe assumes the following prerequisites are met: + +- **GKE Cluster:** A GKE cluster with + [JobSet](https://jobset.sigs.k8s.io/docs/installation/) installed and + running. +- **Container Image:** A pre-built container image (such as + `gcr.io/my-project/my-maxtext-runner:latest`) containing the MaxText + workload, accessible by the GKE cluster. +- **Tools:** `gcloud`, `kubectl`, `gke-gcloud-auth-plugin`, and `envsubst` + installed on your workstation. If `envsubst` is missing, install it with + `sudo apt-get update && sudo apt-get install -y gettext-base`. +- **Permissions:** You have permissions to run `kubectl apply` on the target + cluster and the cluster has permissions to pull the container image. + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- **Orchestration** - + [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- **Pretraining job configuration and deployment** - A Kubernetes manifest + (`k8s_manifest.yaml`) is used to define and deploy the + [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) + resource, which manages the execution of the MaxText pretraining workload. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by the MaxText framework. + +## Run the recipe + +This recipe uses a Kubernetes manifest (`k8s_manifest.yaml`) to deploy the +workload. The following commands will set the required environment variables, +substitute them into `k8s_manifest.yaml`, and apply the resulting configuration +to your cluster. + +### 1. Configure Environment Variables + +Open a terminal and set the following environment variables to match your setup. + +**Note:** + +- `k8s_manifest.yaml` is in the same directory as this README. + +```bash +# Set variables for your environment +export PROJECT_ID="" # Your GCP project name +export CLUSTER_NAME="" # The name of your GKE cluster +export ZONE="" # The zone of your GKE cluster +export BASE_OUTPUT_DIR="" # e.g., "gs://your-bucket-name/my-base-output-dir" +export WORKLOAD_IMAGE="" # e.g., "gcr.io/my-project/my-maxtext-runner:latest" + +# Set workload name (or modify as needed, make sure its unique in the cluster) +export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)" +``` + +### 2. Run deepseek-v3 Pretraining Workload + +Once the environment variables are set, run the following commands to fetch +cluster credentials and deploy the JobSet: + +```bash +# Fetch cluster credentials +gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${ZONE} --project ${PROJECT_ID} + +# Apply the manifest +envsubst '${BASE_OUTPUT_DIR} ${WORKLOAD_NAME} ${WORKLOAD_IMAGE}' < k8s_manifest.yaml | kubectl apply -n default -f - +``` + +## Monitor the job + +To monitor your job's progress, you can use kubectl to check the Jobset status +and logs: + +```bash +# Check JobSet status +kubectl get jobset -n default ${WORKLOAD_NAME} + +# Get the name of the first pod in the JobSet +POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=${WORKLOAD_NAME} -n default -o jsonpath='{.items[0].metadata.name}') + +# Follow the logs of that pod +kubectl logs -f -n default ${POD_NAME} +``` + +You can also monitor your cluster and TPU usage through the Google Cloud +Console: +`https://console.cloud.google.com/kubernetes/workload/overview?project={PROJECT_ID}` + +## Delete resources + +### Delete a specific workload + +To delete the JobSet created by this recipe, run: + +```bash +kubectl delete jobset ${WORKLOAD_NAME} -n default +``` + +## Check results + +After the job completes, you can check the results by: + +- Accessing output logs from your job using `kubectl logs`. +- Checking any data stored in the Google Cloud Storage bucket specified by the + `${BASE_OUTPUT_DIR}` variable in your `run_recipe.sh`. +- Reviewing metrics in Cloud Monitoring, if configured. + +## Next steps: deeper exploration and customization + +This recipe is designed to provide a simple, reproducible "0-to-1" experience +for running a MaxText pre-training workload. Its primary purpose is to help you +verify your environment and achieve a first success with TPUs quickly and +reliably. + +For deeper exploration, including customizing model configurations, tuning +performance with different XLA flags, and running custom experiments, we +recommend using the benchmark_runner.py script directly from the MaxText +repository. This script offers the full range of MaxText's flexibility and is +the ideal tool for power users and researchers who want to move beyond the +initial benchmark and tailor the workload to their specific needs. To learn +more, see the +[MaxText Benchmark Runner Guide](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/Getting_Started_Benchmarking.md) +on using benchmark_runner.py for advanced benchmarking. \ No newline at end of file diff --git a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/k8s/k8s_manifest.yaml b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/k8s/k8s_manifest.yaml new file mode 100644 index 00000000..8c805fe5 --- /dev/null +++ b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/k8s/k8s_manifest.yaml @@ -0,0 +1,75 @@ +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: ${WORKLOAD_NAME} + annotations: + alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment +spec: + ttlSecondsAfterFinished: 43200 + failurePolicy: + rules: + - action: FailJobSet + onJobFailureReasons: + - PodFailurePolicy + maxRestarts: 0 + replicatedJobs: + - name: slice-job + replicas: 1 + template: + spec: + parallelism: 32 # Equal to the number of VMs per slice (or sub-slice). + completions: 32 # Same as the above. + backoffLimit: 0 # When any pod fails, the job is failed + podFailurePolicy: + rules: + - action: FailJob + onPodConditions: [] + onExitCodes: + containerName: jax-tpu + operator: NotIn + values: [42,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255] + template: + spec: + restartPolicy: Never + nodeSelector: + cloud.google.com/gke-tpu-accelerator: tpu7x + cloud.google.com/gke-tpu-topology: 4x4x8 + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + containers: + - name: jax-tpu + image: ${WORKLOAD_IMAGE} + ports: + - containerPort: 8471 + - containerPort: 8080 + securityContext: + privileged: true + command: + - bash + - -c + - | + echo XPK Start: $(date); + _sigterm() (kill -SIGTERM $! 2>/dev/null;); + trap _sigterm SIGTERM; + (export TPU_STDERR_LOG_LEVEL=0 && export TPU_MIN_LOG_LEVEL=0 && export TF_CPP_MIN_LOG_LEVEL=0 && export TPU_VMODULE=real_program_continuator=1 && set -e && export ENABLE_PATHWAYS_PERSISTENCE='1' && export LIBTPU_INIT_ARGS='--xla_tpu_scoped_vmem_limit_kib=65536 --xla_tpu_bf16_emission_mode=NATIVE_EMISSION --xla_tpu_enable_sparse_core_reduce_scatter_v2=true --xla_tpu_enable_sparse_core_collective_offload_all_gather=true --xla_tpu_enable_sparse_core_collective_offload_2d_all_gather=true --xla_tpu_enable_all_gather_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=True --xla_sc_disable_megacore_partitioning=True --xla_tpu_enable_async_collective_fusion_fuse_all_gather=false --xla_enable_async_all_gather=true --xla_tpu_prefer_async_allgather_to_allreduce=true --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_sparse_core_collective_offload_reduce_scatter=true --xla_tpu_enable_sparse_core_collective_offload_3d_all_gather=true --xla_tpu_use_single_sparse_core_for_all_gather_offload=true --xla_tpu_enable_concurrent_sparse_core_offloading=true --xla_tpu_aggressive_opt_barrier_removal=true --xla_tpu_enable_offloading_gather_to_sparsecore=true --xla_tpu_accumulate_into_mrb=true --xla_tpu_mosaic_fusion=false --xla_tpu_pcie_bandwidth_multiplier=0.03 --xla_tpu_enable_layer_scheduler_for_dependent_collectives=true --xla_tpu_enable_sparse_core_collective_aggregator=true --xla_tpu_enable_latency_hiding_layer_scheduler=true --xla_tpu_enable_multi_compute_overlap_in_layer_scheduler=false --xla_tpu_enable_sparse_core_offload_queuing_in_lhs=true --xla_tpu_sparse_core_all_reduce_offload_min_size_in_bytes=204800 --xla_tpu_enable_sparse_core_collective_offload_nd_reduce_scatter=true --xla_tpu_enable_3d_reduce_scatter_decomposer=false --xla_max_concurrent_async_all_gathers=1 --xla_tpu_scheduler_percent_shared_memory_limit=140 --xla_tpu_enable_collective_pipeliner=true --xla_tpu_enable_tree_use_collective_pipeliner=true --xla_latency_hiding_scheduler_rerun=0 --xla_tpu_host_transfer_overlap_limit=1 ' && export JAX_PLATFORMS='tpu,cpu' && export ENABLE_PJRT_COMPATIBILITY='true' && python3 -m MaxText.train MaxText/configs/base.yml model_name=deepseek3-671b per_device_batch_size=8.0 max_target_length=4096 dcn_pipeline_parallelism=1 dcn_data_parallelism=-1 ici_pipeline_parallelism=1 ici_fsdp_transpose_parallelism=1 ici_fsdp_parallelism=-1 allow_split_physical_axes=True use_iota_embed=True remat_policy=custom decoder_layer_input=offload opt_type=adamw mu_dtype=bfloat16 grad_dtype=bfloat16 megablox=True sparse_matmul=True use_custom_sort_vjp=True fsdp_shard_on_exp=True sa_use_fused_bwd_kernel=True sa_block_q=2048 sa_block_kv=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_kv_dq=2048 sa_block_q_dq=2048 attention=flash use_tokamax_splash=True use_max_logit_estimate=-1 cost_estimate_flops_fwd=5000000000000 cost_estimate_flops_bwd=5000000000000 float32_weight_sum=False use_tokamax_gmm=True tokenizer_path=assets/tokenizer.mistral-v3 dataset_type=synthetic dataset_path=gs://max-datasets-rogue use_qwix_quantization=True quantization=fp8_full wi_tile_fwd_batch_seq=128 wi_tile_fwd_embed_dim=7168 wi_tile_fwd_mlp_dim=2048 wi_tile_dlhs_batch_seq=256 wi_tile_dlhs_embed_dim=2048 wi_tile_dlhs_mlp_dim=3584 wi_tile_drhs_batch_seq=256 wi_tile_drhs_embed_dim=1792 wi_tile_drhs_mlp_dim=2048 wo_tile_fwd_batch_seq=256 wo_tile_fwd_embed_dim=2048 wo_tile_fwd_mlp_dim=3584 wo_tile_dlhs_batch_seq=256 wo_tile_dlhs_embed_dim=7168 wo_tile_dlhs_mlp_dim=1024 wo_tile_drhs_batch_seq=256 wo_tile_drhs_embed_dim=2048 wo_tile_drhs_mlp_dim=1792 weight_quantization_calibration_method=fixed,-224,224 act_quantization_calibration_method=fixed,-224,224 enable_checkpointing=False steps=30 base_output_directory=${BASE_OUTPUT_DIR} run_name=${WORKLOAD_NAME}) & PID=$!; + while kill -0 $PID 2>/dev/null; + do sleep 5; + done; + wait $PID; + EXIT_CODE=$?; + echo XPK End: $(date); + echo EXIT_CODE=$EXIT_CODE; + exit $EXIT_CODE + resources: + limits: + google.com/tpu: 4 + volumeMounts: + - mountPath: /dev/shm + name: dshm-2 + tolerations: + - operator: "Exists" + key: google.com/tpu + volumes: + - emptyDir: + medium: Memory + name: dshm-2 diff --git a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/README.md b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/README.md similarity index 88% rename from training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/README.md rename to training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/README.md index 3881f11c..a45163ae 100644 --- a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/README.md +++ b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/README.md @@ -1,6 +1,6 @@ -# Pretrain deepseek3-671b workload on Ironwood GKE clusters with XPK +# Pretrain deepseek-v3 workload on Ironwood GKE clusters with XPK -This recipe outlines the steps for running a deepseek3-671b +This recipe outlines the steps for running a deepseek-v3 [MaxText](https://github.com/AI-Hypercomputer/maxtext) pretraining workload on [Ironwood GKE clusters](https://cloud.google.com/kubernetes-engine) by using [XPK](https://github.com/AI-Hypercomputer/xpk). @@ -35,13 +35,14 @@ To run this recipe, you need the following: in the [Install XPK and dependencies](#install-xpk-and-dependencies) section to install Docker. - **Python 3.11 Virtual Environment:** A Python - 3.11 virtual environment is required. Instructions for - setting this up are also in the + 3.11 virtual environment is required. Instructions + for setting this up are also in the [Install XPK and dependencies](#install-xpk-and-dependencies) section. - **XPK and Dependencies:** Follow the steps in the [Install XPK and dependencies](#install-xpk-and-dependencies) section to install XPK, `kubectl`, `kubectl-kueue`, and `kubectl-kjob`. + ## Install XPK and dependencies ### XPK and Dependency Installation @@ -57,11 +58,11 @@ curl -LsSf https://astral.sh/uv/install.sh -o install-uv.sh chmod +x install-uv.sh ./install-uv.sh rm install-uv.sh -source ~/.local/bin/env +source ${HOME}/.local/bin/env # Set up and Activate Python 3.11 virtual environment -uv venv --seed ~/.local/bin/venv --python 3.11 --clear -source ~/.local/bin/venv/bin/activate +uv venv --seed ${HOME}/.local/bin/venv --python 3.11 --clear +source ${HOME}/.local/bin/venv/bin/activate pip install --upgrade pip ``` @@ -81,9 +82,9 @@ Install XPK and necessary tools: pip install xpk==0.16.1 # Install xpk pre-reqs kubectl-kueue and kjob (if you installed xpk via pip) -curl -LsSf https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/tags/v0.16.0/tools/install-xpk.sh -o install-xpk.sh +curl -LsSf https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/tags/v0.16.1/tools/install-xpk.sh -o install-xpk.sh chmod +x install-xpk.sh -./install-xpk.sh +sudo ./install-xpk.sh rm install-xpk.sh # Follow https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin to install gke-gcloud-auth-plugin @@ -110,7 +111,7 @@ For this recipe, the following setup is used: - **Pretraining job configuration and deployment** - XPK is used to configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) - resource, which manages the execution of the MaxText pretraining workload. + resource, which manages the execution of the deepseek-v3 workload. ## Test environment @@ -132,12 +133,13 @@ across all commands and configurations. - `PROJECT_ID`: Your GCP project name. - `CLUSTER_NAME`: The target cluster name. - `ZONE`: The zone for your cluster (e.g., `us-central1-c`). +- `CONTAINER_REGISTRY`: The container registry to use (e.g., `gcr.io`). - `BASE_OUTPUT_DIR`: Output directory for model training (e.g., `"gs://"`). -- `CONTAINER_REGISTRY`: The container registry to use (e.g., gcr.io). - `WORKLOAD_IMAGE`: The Docker image for the workload. This is set in - `run_recipe.sh` to `${CONTAINER_REGISTRY}/${PROJECT_ID}/${USER}-maxtext-runner` by default, - matching the image built in the + `run_recipe.sh` to + `${CONTAINER_REGISTRY}/${PROJECT_ID}/${USER}-deepseek-v3-runner` by + default, matching the image built in the [Docker container image](#docker-container-image) section. - `WORKLOAD_NAME`: A unique name for your workload. This is set in `run_recipe.sh` using the following command: @@ -149,7 +151,7 @@ across all commands and configurations. within the same project. For a shared project, use `"projects//reservations/"`. -If you don’t have a GCS bucket, create one with this command: +If you don't have a GCS bucket, create one with this command: ```bash # Make sure BASE_OUTPUT_DIR is set in run_recipe.sh before running this. @@ -178,11 +180,11 @@ XPK and its dependencies. Docker installation is part of this process. The following software versions are used: -- Libtpu version: 0.0.31.dev20251119+nightly -- Jax version: 0.8.1 -- Maxtext version: maxtext-tutorial-v1.5.0 -- Python 3.11 -- XPK 0.14.3 +- Libtpu version: 0.0.33.dev20260104+nightly +- Jax version: 0.8.3.dev20260104 +- Maxtext version: 98a3d4c +- Python: 3.11 +- XPK: 0.16.1 Docker Image Building Command: @@ -191,24 +193,24 @@ export CONTAINER_REGISTRY="" # Initialize with your registry export CLOUD_IMAGE_NAME="${USER}-maxtext-runner" export WORKLOAD_IMAGE="${CONTAINER_REGISTRY}/${PROJECT_ID}/${CLOUD_IMAGE_NAME}" -# Let's temporarily switch to a Python 3.12 virtual environment for Docker build -uv venv --seed ~/.local/bin/venv-docker --python 3.12 --clear -source ~/.local/bin/venv-docker/bin/activate +# Set up and Activate Python 3.12 virtual environment for Docker build +uv venv --seed ${HOME}/.local/bin/venv-docker --python 3.12 --clear +source ${HOME}/.local/bin/venv-docker/bin/activate pip install --upgrade pip # Make sure you're running on a Virtual Environment with python 3.12 -if [[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]]; then { echo You have the correct Python version 3.12; } else { >&2 echo Error: Python version must be 3.12; } fi +if [[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]]; then { echo "You have the correct Python version 3.12"; } else { >&2 echo "Error: Python version must be 3.12."; false; } fi # Clone MaxText Repository and Checkout Recipe Branch git clone https://github.com/AI-Hypercomputer/maxtext.git cd maxtext -git checkout maxtext-tutorial-v1.5.0 +git checkout 98a3d4c # Build and upload the docker image bash dependencies/scripts/docker_build_dependency_image.sh \ MODE=nightly \ - JAX_VERSION=0.8.1 \ - LIBTPU_VERSION=0.0.31.dev20251119+nightly + JAX_VERSION=0.8.3.dev20260104 \ + LIBTPU_VERSION=0.0.33.dev20260104+nightly bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME} # Deactivate the virtual environment @@ -237,16 +239,15 @@ does this for you already): gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} ``` -### Run deepseek3-671b Pretraining Workload +### Run deepseek-v3 Pretraining Workload The `run_recipe.sh` script contains all the necessary environment variables and -configurations to launch the deepseek3-671b pretraining workload. +configurations to launch the deepseek-v3 pretraining workload. To run the benchmark, first make the script executable and then run it: ```bash chmod +x run_recipe.sh - ./run_recipe.sh ``` @@ -321,7 +322,7 @@ xpk workload list --cluster ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZON For more in-depth debugging, use xpk inspector: (`xpk inspector`) ```bash -xpk inspector --cluster ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} [--workload ] +xpk inspector --cluster ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} [--workload ${WORKLOAD_NAME}] ``` ### Delete resources @@ -329,7 +330,7 @@ xpk inspector --cluster ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} [ #### Delete a specific workload ```bash -xpk workload delete --workload --cluster ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} +xpk workload delete --workload ${WORKLOAD_NAME} --cluster ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} # Or filter and delete: xpk workload delete --cluster ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} --filter-by-job=${USER} ``` @@ -349,6 +350,7 @@ After the job completes, you can check the results by: `${BASE_OUTPUT_DIR}` variable in your `run_recipe.sh`. - Reviewing metrics in Cloud Monitoring, if configured. + ## Next steps: deeper exploration and customization This recipe is designed to provide a simple, reproducible "0-to-1" experience diff --git a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/run_recipe.sh b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/run_recipe.sh similarity index 93% rename from training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/run_recipe.sh rename to training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/run_recipe.sh index 590e372a..cd5c1748 100755 --- a/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/run_recipe.sh +++ b/training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/xpk/run_recipe.sh @@ -13,7 +13,7 @@ source "${UV_VENV_PATH}/bin/activate" # Check if xpk is installed in the venv if ! pip show xpk &> /dev/null; then echo "xpk not found in the virtual environment. Please install it by running:" - echo "pip install xpk==0.16.0" + echo "pip install xpk==0.16.1" exit 1 fi # --- End Environment Setup --- @@ -57,13 +57,11 @@ XLA_FLAGS=" \ --xla_tpu_enable_layer_scheduler_for_dependent_collectives=true \ --xla_tpu_enable_sparse_core_collective_aggregator=true \ --xla_tpu_enable_latency_hiding_layer_scheduler=true \ - --xla_tpu_enable_multi_compute_overlap_in_layer_scheduler=true \ - --xla_lhs_threshold_for_applying_output_fusion_latency_multiplier=1e10 \ - --xla_lhs_output_fusion_latency_multiplier=1e-3 \ + --xla_tpu_enable_multi_compute_overlap_in_layer_scheduler=false \ --xla_tpu_enable_sparse_core_offload_queuing_in_lhs=true \ - --xla_tpu_sparse_core_all_gather_latency_multiplier=1.3 \ --xla_tpu_sparse_core_all_reduce_offload_min_size_in_bytes=204800 \ - --xla_tpu_sparse_core_reduce_scatter_latency_multiplier=3 \ + --xla_tpu_enable_sparse_core_collective_offload_nd_reduce_scatter=true \ + --xla_tpu_enable_3d_reduce_scatter_decomposer=false \ --xla_max_concurrent_async_all_gathers=1 \ --xla_tpu_scheduler_percent_shared_memory_limit=140 \ --xla_tpu_enable_collective_pipeliner=true \ @@ -151,4 +149,4 @@ xpk workload create \ --command="set -e && export ENABLE_PATHWAYS_PERSISTENCE='1' && \ export LIBTPU_INIT_ARGS='${XLA_FLAGS}' && \ export JAX_PLATFORMS='tpu,cpu' && export ENABLE_PJRT_COMPATIBILITY='true' && \ -python3 -m MaxText.train MaxText/configs/base.yml ${MAXTEXT_ARGS}" +python3 -m MaxText.train MaxText/configs/base.yml ${MAXTEXT_ARGS}" \ No newline at end of file