Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion training/ironwood/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The training recipes contained in this folder are optimized for Ironwood TPU. He
| <div style="width:100px;">Model ID</div> | Number of chips | GBS | Sequence length | Precision | Step time (seconds) | TFLOPs/sec/chip | Tokens/sec/chip |
|-----------------|--------------------|--------------|--------------------------|--------------------|-------------|--------------|-----------------------|
| deepseek-v3 | 128 | 2048 | 4096 | bf16 | 27.02 | 607.53 | 2,425.75 |
| deepseek-v3 | 128 | 2048 | 4096 | fp8_full | 22.83 | 718.57 | 2,869.59 |
| deepseek-v3 | 128 | 2048 | 4096 | fp8_full | 22.47 | 730.60 | 2,917.15 |
| deepseek-v3 | 256 | 4096 | 4096 | bf16 | 26.79 | 612.66 | 2,446.25 |
| deepseek-v3 | 256 | 4096 | 4096 | fp8_full | 22.08 | 743.46 | 2,968.49 |
| gpt-oss-120b | 64 | 1280 | 8192 | bf16 | 17.10 | 330.25 | 9,581.66 |
Expand Down
141 changes: 141 additions & 0 deletions training/ironwood/deepseek3-671b/4k-fp8-tpu7x-4x4x8/k8s/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Pretrain deepseek-v3 workload on Ironwood GKE clusters with Kubernetes JobSet

This recipe outlines the steps for running a deepseek-v3
[MaxText](https://github.com/AI-Hypercomputer/maxtext) pretraining workload on
[Ironwood GKE clusters](https://cloud.google.com/kubernetes-engine)
by applying a Kubernetes manifest to deploy a JobSet resource.

## Workload Details

This workload is configured with the following details:

- Sequence Length: 4096
- Precision: fp8
- Chips: 128 (4x4x8 topology)

## Prerequisites

This recipe assumes the following prerequisites are met:

- **GKE Cluster:** A GKE cluster with
[JobSet](https://jobset.sigs.k8s.io/docs/installation/) installed and
running.
- **Container Image:** A pre-built container image (such as
`gcr.io/my-project/my-maxtext-runner:latest`) containing the MaxText
workload, accessible by the GKE cluster.
- **Tools:** `gcloud`, `kubectl`, `gke-gcloud-auth-plugin`, and `envsubst`
installed on your workstation. If `envsubst` is missing, install it with
`sudo apt-get update && sudo apt-get install -y gettext-base`.
- **Permissions:** You have permissions to run `kubectl apply` on the target
cluster and the cluster has permissions to pull the container image.

## Orchestration and deployment tools

For this recipe, the following setup is used:

- **Orchestration** -
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
- **Pretraining job configuration and deployment** - A Kubernetes manifest
(`k8s_manifest.yaml`) is used to define and deploy the
[Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
resource, which manages the execution of the MaxText pretraining workload.

## Training dataset

This recipe uses a mock pretraining dataset provided by the MaxText framework.

## Run the recipe

This recipe uses a Kubernetes manifest (`k8s_manifest.yaml`) to deploy the
workload. The following commands will set the required environment variables,
substitute them into `k8s_manifest.yaml`, and apply the resulting configuration
to your cluster.

### 1. Configure Environment Variables

Open a terminal and set the following environment variables to match your setup.

**Note:**

- `k8s_manifest.yaml` is in the same directory as this README.

```bash
# Set variables for your environment
export PROJECT_ID="" # Your GCP project name
export CLUSTER_NAME="" # The name of your GKE cluster
export ZONE="" # The zone of your GKE cluster
export BASE_OUTPUT_DIR="" # e.g., "gs://your-bucket-name/my-base-output-dir"
export WORKLOAD_IMAGE="" # e.g., "gcr.io/my-project/my-maxtext-runner:latest"

# Set workload name (or modify as needed, make sure its unique in the cluster)
export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"
```

### 2. Run deepseek-v3 Pretraining Workload

Once the environment variables are set, run the following commands to fetch
cluster credentials and deploy the JobSet:

```bash
# Fetch cluster credentials
gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${ZONE} --project ${PROJECT_ID}

# Apply the manifest
envsubst '${BASE_OUTPUT_DIR} ${WORKLOAD_NAME} ${WORKLOAD_IMAGE}' < k8s_manifest.yaml | kubectl apply -n default -f -
```

## Monitor the job

To monitor your job's progress, you can use kubectl to check the Jobset status
and logs:

```bash
# Check JobSet status
kubectl get jobset -n default ${WORKLOAD_NAME}

# Get the name of the first pod in the JobSet
POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=${WORKLOAD_NAME} -n default -o jsonpath='{.items[0].metadata.name}')

# Follow the logs of that pod
kubectl logs -f -n default ${POD_NAME}
```

You can also monitor your cluster and TPU usage through the Google Cloud
Console:
`https://console.cloud.google.com/kubernetes/workload/overview?project={PROJECT_ID}`

## Delete resources

### Delete a specific workload

To delete the JobSet created by this recipe, run:

```bash
kubectl delete jobset ${WORKLOAD_NAME} -n default
```

## Check results

After the job completes, you can check the results by:

- Accessing output logs from your job using `kubectl logs`.
- Checking any data stored in the Google Cloud Storage bucket specified by the
`${BASE_OUTPUT_DIR}` variable in your `run_recipe.sh`.
- Reviewing metrics in Cloud Monitoring, if configured.

## Next steps: deeper exploration and customization

This recipe is designed to provide a simple, reproducible "0-to-1" experience
for running a MaxText pre-training workload. Its primary purpose is to help you
verify your environment and achieve a first success with TPUs quickly and
reliably.

For deeper exploration, including customizing model configurations, tuning
performance with different XLA flags, and running custom experiments, we
recommend using the benchmark_runner.py script directly from the MaxText
repository. This script offers the full range of MaxText's flexibility and is
the ideal tool for power users and researchers who want to move beyond the
initial benchmark and tailor the workload to their specific needs. To learn
more, see the
[MaxText Benchmark Runner Guide](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/Getting_Started_Benchmarking.md)
on using benchmark_runner.py for advanced benchmarking.
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: ${WORKLOAD_NAME}
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
spec:
ttlSecondsAfterFinished: 43200
failurePolicy:
rules:
- action: FailJobSet
onJobFailureReasons:
- PodFailurePolicy
maxRestarts: 0
replicatedJobs:
- name: slice-job
replicas: 1
template:
spec:
parallelism: 32 # Equal to the number of VMs per slice (or sub-slice).
completions: 32 # Same as the above.
backoffLimit: 0 # When any pod fails, the job is failed
podFailurePolicy:
rules:
- action: FailJob
onPodConditions: []
onExitCodes:
containerName: jax-tpu
operator: NotIn
values: [42,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
template:
spec:
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu7x
cloud.google.com/gke-tpu-topology: 4x4x8
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: jax-tpu
image: ${WORKLOAD_IMAGE}
ports:
- containerPort: 8471
- containerPort: 8080
securityContext:
privileged: true
command:
- bash
- -c
- |
echo XPK Start: $(date);
_sigterm() (kill -SIGTERM $! 2>/dev/null;);
trap _sigterm SIGTERM;
(export TPU_STDERR_LOG_LEVEL=0 && export TPU_MIN_LOG_LEVEL=0 && export TF_CPP_MIN_LOG_LEVEL=0 && export TPU_VMODULE=real_program_continuator=1 && set -e && export ENABLE_PATHWAYS_PERSISTENCE='1' && export LIBTPU_INIT_ARGS='--xla_tpu_scoped_vmem_limit_kib=65536 --xla_tpu_bf16_emission_mode=NATIVE_EMISSION --xla_tpu_enable_sparse_core_reduce_scatter_v2=true --xla_tpu_enable_sparse_core_collective_offload_all_gather=true --xla_tpu_enable_sparse_core_collective_offload_2d_all_gather=true --xla_tpu_enable_all_gather_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=True --xla_sc_disable_megacore_partitioning=True --xla_tpu_enable_async_collective_fusion_fuse_all_gather=false --xla_enable_async_all_gather=true --xla_tpu_prefer_async_allgather_to_allreduce=true --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_sparse_core_collective_offload_reduce_scatter=true --xla_tpu_enable_sparse_core_collective_offload_3d_all_gather=true --xla_tpu_use_single_sparse_core_for_all_gather_offload=true --xla_tpu_enable_concurrent_sparse_core_offloading=true --xla_tpu_aggressive_opt_barrier_removal=true --xla_tpu_enable_offloading_gather_to_sparsecore=true --xla_tpu_accumulate_into_mrb=true --xla_tpu_mosaic_fusion=false --xla_tpu_pcie_bandwidth_multiplier=0.03 --xla_tpu_enable_layer_scheduler_for_dependent_collectives=true --xla_tpu_enable_sparse_core_collective_aggregator=true --xla_tpu_enable_latency_hiding_layer_scheduler=true --xla_tpu_enable_multi_compute_overlap_in_layer_scheduler=false --xla_tpu_enable_sparse_core_offload_queuing_in_lhs=true --xla_tpu_sparse_core_all_reduce_offload_min_size_in_bytes=204800 --xla_tpu_enable_sparse_core_collective_offload_nd_reduce_scatter=true --xla_tpu_enable_3d_reduce_scatter_decomposer=false --xla_max_concurrent_async_all_gathers=1 --xla_tpu_scheduler_percent_shared_memory_limit=140 --xla_tpu_enable_collective_pipeliner=true --xla_tpu_enable_tree_use_collective_pipeliner=true --xla_latency_hiding_scheduler_rerun=0 --xla_tpu_host_transfer_overlap_limit=1 ' && export JAX_PLATFORMS='tpu,cpu' && export ENABLE_PJRT_COMPATIBILITY='true' && python3 -m MaxText.train MaxText/configs/base.yml model_name=deepseek3-671b per_device_batch_size=8.0 max_target_length=4096 dcn_pipeline_parallelism=1 dcn_data_parallelism=-1 ici_pipeline_parallelism=1 ici_fsdp_transpose_parallelism=1 ici_fsdp_parallelism=-1 allow_split_physical_axes=True use_iota_embed=True remat_policy=custom decoder_layer_input=offload opt_type=adamw mu_dtype=bfloat16 grad_dtype=bfloat16 megablox=True sparse_matmul=True use_custom_sort_vjp=True fsdp_shard_on_exp=True sa_use_fused_bwd_kernel=True sa_block_q=2048 sa_block_kv=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_kv_dq=2048 sa_block_q_dq=2048 attention=flash use_tokamax_splash=True use_max_logit_estimate=-1 cost_estimate_flops_fwd=5000000000000 cost_estimate_flops_bwd=5000000000000 float32_weight_sum=False use_tokamax_gmm=True tokenizer_path=assets/tokenizer.mistral-v3 dataset_type=synthetic dataset_path=gs://max-datasets-rogue use_qwix_quantization=True quantization=fp8_full wi_tile_fwd_batch_seq=128 wi_tile_fwd_embed_dim=7168 wi_tile_fwd_mlp_dim=2048 wi_tile_dlhs_batch_seq=256 wi_tile_dlhs_embed_dim=2048 wi_tile_dlhs_mlp_dim=3584 wi_tile_drhs_batch_seq=256 wi_tile_drhs_embed_dim=1792 wi_tile_drhs_mlp_dim=2048 wo_tile_fwd_batch_seq=256 wo_tile_fwd_embed_dim=2048 wo_tile_fwd_mlp_dim=3584 wo_tile_dlhs_batch_seq=256 wo_tile_dlhs_embed_dim=7168 wo_tile_dlhs_mlp_dim=1024 wo_tile_drhs_batch_seq=256 wo_tile_drhs_embed_dim=2048 wo_tile_drhs_mlp_dim=1792 weight_quantization_calibration_method=fixed,-224,224 act_quantization_calibration_method=fixed,-224,224 enable_checkpointing=False steps=30 base_output_directory=${BASE_OUTPUT_DIR} run_name=${WORKLOAD_NAME}) & PID=$!;
while kill -0 $PID 2>/dev/null;
do sleep 5;
done;
wait $PID;
EXIT_CODE=$?;
echo XPK End: $(date);
echo EXIT_CODE=$EXIT_CODE;
exit $EXIT_CODE
resources:
limits:
google.com/tpu: 4
volumeMounts:
- mountPath: /dev/shm
name: dshm-2
tolerations:
- operator: "Exists"
key: google.com/tpu
volumes:
- emptyDir:
medium: Memory
name: dshm-2
Loading