LevelDownRefine · LevelDownRefine · Aug 7, 2025 · Aug 7, 2025 · Aug 7, 2025 · Aug 7, 2025
diff --git a/.github/pytorch-probot.yml b/.github/pytorch-probot.yml
@@ -3,3 +3,4 @@ ciflow_push_tags:
 - ciflow/benchmark
 - ciflow/tutorials
 - ciflow/rocm
+- ciflow/4xh100
diff --git a/.github/scripts/torchao_model_releases/README.md b/.github/scripts/torchao_model_releases/README.md
@@ -0,0 +1,142 @@
+# Scripts for torchao Model Release and Eval
+
+Note: all commands below should be run in directory: `.github/scripts/torchao_model_releases/`
+
+## Frequently Used Commands
+### Release and Eval Scripts for New Model Releases
+```
+MODEL=Qwen/Qwen3-8B
+# Releasing all models: INT4, INT8, INT8-INT4
+sh release.sh --model_id $MODEL --push_to_hub --populate_model_card_template
+
+# INT8-INT4 requires additional steps to export and run so it's skipped from
+# general eval here
+# Need to set QMODEL_PREFIX properly before running eval
+# QMODEL_PREFIX=pytorch/Qwen3-8B
+sh eval.sh --model_ids $MODEL "$QMODEL_PREFIX-FP8" "$QMODEL_PREFIX-INT4"
+
+# Some follow up evals
+sh eval.sh --eval_type latency --batch_size 256 "$QMODEL_PREFIX-FP8"
+sh eval.sh --eval_type quality --batch_size 256 "$QMODEL_PREFIX-INT8-INT4"
+
+# Summarize all results
+sh summarize_results.sh --model_ids $MODEL "$QMODEL_PREFIX-FP8" "$QMODEL_PREFIX-INT4" "$QMODEL_PREFIX-INT8-INT4" "$QMODEL_PREFIX-AWQ-INT4"
+```
+
+### AWQ Release and Eval
+```
+MODEL=Qwen/Qwen3-8B
+TASK=mmlu_abstract_algebra
+python quantize_and_upload.py --model_id $MODEL --quant AWQ-INT4 --push_to_hub --task $TASK --calibration_limit 10 --populate_model_card_template
+sh eval.sh --model_ids $MODEL "$QMODEL_PREFIX-AWQ-INT4"
+```
+
+### Update Released Checkpoints in PyTorch
+Sometimes we may have to update the checkpoints under a different user name (organization) without changing the model card, e.g. for INT4
+```
+MODEL=Qwen/Qwen3-8B
+sh release.sh --model $MODEL --quants INT4 --push_to_hub --push_to_user_id pytorch
+```
+
+Or AWQ checkpoint:
+```
+MODEL=Qwen/Qwen3-8B
+TASK=mmlu_abstract_algebra
+python quantize_and_upload.py --model_id $MODEL --quant AWQ-INT4--task $TASK --calibration_limit 10 --push_to_hub --push_to_user_id pytorch
+```
+
+## Release Scripts
+### default options
+By default, we release FP8, INT4, INT8-INT4 checkpoints, with model card pre-filled with template content, that can be modified later after we have eval results.
+
+Examples:
+```
+# Note: first login with `huggingface-cli login`, the quantized model will be uploaded to
+# the logged in user
+
+# release with default quant options (FP8, INT4, INT8-INT4)
+./release.sh --model_id Qwen/Qwen3-8B --push_to_hub
+
+# release a custom set of quant options
+./release.sh --model_id Qwen/Qwen3-8B --quants INT4 FP8 --push_to_hub
+```
+
+Note: for initial release, please include `--populate_model_card_template` to populate model card template.
+
+### AWQ-INT4
+[AWQ](https://arxiv.org/abs/2306.00978) is a technique to improve accuracy for weight only quantization. It improves accuracy by preserving "salient" weight channels that has high impact on the accuracy of output, through multiplying the weight channel by a scale, and do the reverse for the correspnoding activation, since activation is not quantized, there is no additional loss from activation, while the quantization loss from weight can be reduced.
+
+After eval for INT4 checkpoint is done, we might find some task have a large accuracy drop compared to high precision baseline, in that case we can do a calibration for that task, with a few samples, tasks are selected from [lm-eval](https://github.com/EleutherAI/lm-eval\uation-harness/blob/main/lm_eval/tasks/README.md). You can follow [new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to add new tasks to lm-eval.
+
+Examples:
+```
+# release AWQ-INT4 model, calibrated with a specific task
+# with some calibration_limit (number of samples)
+python quantize_and_upload.py --model_id Qwen/Qwen3-8B --quant AWQ-INT4 --push_to_hub --task bbh --calibration_limit 2
+```
+
+### Update checkpoints for a different user_id (e.g. pytorch)
+Sometimes we may want to update the checkpoints for a different user id, without changing model card. For this we can use `--push_to_user_id`, e.g.
+
+```
+sh release.sh --model_id microsoft/Phi-4-mini-instruct --quants FP8 --push_to_hub --push_to_user_id pytorch
+```
+
+This will update `pytorch/Phi-4-mini-instruct-FP8` without changing the model card.
+
+## Eval Scripts
+After we run the release script for a model, we can find new models in the huggingface hub page for the user, e.g. https://huggingface.co/torchao-testing, the models will have a model card that's filled in with template content, such as information about the model and eval instructions, there are a few things we need to fill in, including 1. peak memory usage, 2. latency when running model with vllm and 3. quality measurement using lm-eval.
+
+### Single Script
+The simplest is just to run all three evals. Please check out `Run Single Evals` section to make sure the environment is setup correctly. This includes:
+1. install [vllm](https://github.com/vllm-project/vllm) from source and set `VLLM_DIR` to the soruce directory of vllm
+2. install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
+
+```
+sh eval.sh --eval_type all --model_ids Qwen/Qwen3-8B pytorch/Qwen3-8B-INT4
+```
+
+If `eval_type` is all, we'll also run summarize results for the list of `model_ids`, summarized results will be found in files: `summary_results_Qwen_Qwen3-8B.log` and `summary_results_pytorch_Qwen3-8B-INT4.log`.
+
+Then we can fill in the blanks in the model cards of uploaded checkpoints.
+
+### Separate Scripts
+#### Memory Eval
+```
+sh eval.sh --eval_type memory --model_ids Qwen/Qwen3-8B
+```
+
+#### Latency Eval
+For latency eval, make sure vllm is installed.
+```
+uv pip install vllm
+```
+
+Or install vllm nightly:
+```
+uv pip install vllm --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu126
+```
+
+After environment is setup, we can run eval:
+```
+sh eval.sh --eval_type latency --model_ids Qwen/Qwen3-8B --batch_sizes 1,256
+```
+
+#### Model Quality Eval
+For model quality eval, we need to install lm-eval
+```
+uv pip install lm-eval
+```
+After environment is setup, we can run eval:
+```
+sh eval.sh --eval_type quality --model_ids Qwen/Qwen3-8B --tasks hellaswag,mmlu
+```
+
+#### Summarize results
+After we have finished all evals for each model, we can summarize the results with:
+```
+sh summarize_results.sh --model_ids Qwen/Qwen3-8B pytorch/Qwen3-8B-INT4
+```
+Summarized results files for above command: `summary_results_Qwen_Qwen3-8B.log` and `summary_results_pytorch_Qwen3-8B-INT4.log`
+
+It will look through the current directory to find all the result files from memory, latency and quality evals and combine all the result information into a single file.
diff --git a/.github/scripts/torchao_model_releases/eval.sh b/.github/scripts/torchao_model_releases/eval.sh
@@ -0,0 +1,114 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD 3-Clause license found in the
+# LICENSE file in the root directory of this source tree.
+
+#!/bin/bash
+set -e
+source eval_env_checks.sh
+
+usage() {
+  echo "Usage: $0 --model_ids <model1> <model2> ... [--eval_type <all|memory|latency|quality>] [--batch_sizes <batch_sizes>] [--tasks <tasks>]"
+  echo "Defaults:"
+  echo "  batch_sizes: 1 256"
+  echo "  tasks: mmlu"
+  exit 1
+}
+MODEL_ID_ARRAY=()
+EVAL_TYPE="all"
+# these will be parsed in the other scripts
+BATCH_SIZES="1 256"    # Default for latency eval
+TASKS="mmlu"           # Default for quality eval
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --eval_type)
+      shift
+      if [[ $# -eq 0 ]]; then
+        echo "Error: --eval_type requires a value"
+        exit 1
+      fi
+      EVAL_TYPE="$1"
+      shift
+      ;;
+    --model_ids)
+      shift
+      # Collect all subsequent arguments that are not another flag
+      while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
+        MODEL_ID_ARRAY+=("$1")
+        shift
+      done
+      ;;
+    --batch_sizes)
+      shift
+      if [[ $# -eq 0 ]]; then
+        echo "Error: --batch_sizes requires a value"
+        exit 1
+      fi
+      BATCH_SIZES="$1"
+      shift
+      ;;
+    --tasks)
+      shift
+      if [[ $# -eq 0 ]]; then
+        echo "Error: --tasks requires a value"
+        exit 1
+      fi
+      TASKS="$1"
+      shift
+      ;;
+    *)
+      echo "Unknown argument: $1"
+      usage
+      ;;
+  esac
+done
+if [[ ${#MODEL_ID_ARRAY[@]} -eq 0 ]]; then
+  echo "Error: --model_ids is required"
+  usage
+fi
+
+run_memory() {
+  check_torch
+  local model_id="$1"
+  sh eval_memory.sh --model_ids "$model_id"
+}
+run_latency() {
+  check_vllm
+  local model_id="$1"
+  sh eval_latency.sh --model_ids "$model_id" --batch_sizes $BATCH_SIZES
+}
+run_quality() {
+  check_lm_eval
+  local model_id="$1"
+  sh eval_quality.sh --model_ids "$model_id" --tasks $TASKS
+}
+for MODEL_ID in "${MODEL_ID_ARRAY[@]}"; do
+  case "$EVAL_TYPE" in
+    memory)
+      run_memory "$MODEL_ID"
+      ;;
+    latency)
+      run_latency "$MODEL_ID"
+      ;;
+    quality)
+      run_quality "$MODEL_ID"
+      ;;
+    all)
+      run_quality "$MODEL_ID"
+      run_memory "$MODEL_ID"
+      run_latency "$MODEL_ID"
+      ;;
+    *)
+      echo "Unknown eval_type: $EVAL_TYPE"
+      echo "Valid types are: all, memory, latency, quality"
+      exit 2
+      ;;
+  esac
+done
+
+# Run summarize_results.sh with MODEL_IDS if eval_type is "all"
+if [[ "$EVAL_TYPE" == "all" ]]; then
+  sh summarize_results.sh --model_ids "${MODEL_ID_ARRAY[@]}"
+fi
diff --git a/.github/scripts/torchao_model_releases/eval_env_checks.sh b/.github/scripts/torchao_model_releases/eval_env_checks.sh
@@ -0,0 +1,26 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD 3-Clause license found in the
+# LICENSE file in the root directory of this source tree.
+
+check_torch() {
+  if ! pip show torch > /dev/null 2>&1; then
+    echo "Error: torch package is NOT installed. please install with `pip install torch`" >&2
+    exit 1
+  fi
+}
+
+check_vllm() {
+  if ! pip show vllm > /dev/null 2>&1; then
+    echo "Error: vllm package is NOT installed. please install with `pip install vllm`" >&2
+    exit 1
+  fi
+}
+
+check_lm_eval() {
+  if ! pip show lm_eval > /dev/null 2>&1; then
+    echo "Error: lm_eval package is NOT installed. please install with `pip install lm_eval`" >&2
+    exit 1
+  fi
+}
diff --git a/.github/scripts/torchao_model_releases/eval_latency.sh b/.github/scripts/torchao_model_releases/eval_latency.sh
@@ -0,0 +1,85 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD 3-Clause license found in the
+# LICENSE file in the root directory of this source tree.
+
+#!/bin/bash
+set -e
+source eval_env_checks.sh
+check_vllm
+
+MODEL_ID_ARRAY=()
+BATCH_SIZE_ARRAY=(1)  # default can be overwritten by user input
+INPUT_LEN="256"      # default input length
+OUTPUT_LEN="256"     # default output length
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --model_ids)
+      shift
+      # Collect all subsequent arguments that are not another flag
+      while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
+        MODEL_ID_ARRAY+=("$1")
+        shift
+      done
+      ;;
+    --batch_sizes)
+      shift
+      BATCH_SIZE_ARRAY=()
+      # Collect all subsequent arguments that are not another flag
+      while [[ $# -gt 0 && ! "$1" =~ ^-- ]]; do
+        BATCH_SIZE_ARRAY+=("$1")
+        shift
+      done
+      ;;
+    --input_len)
+      shift
+      if [[ $# -eq 0 ]]; then
+        echo "Error: --input_len requires a value"
+        exit 1
+      fi
+      INPUT_LEN="$1"
+      shift
+      ;;
+    --output_len)
+      shift
+      if [[ $# -eq 0 ]]; then
+        echo "Error: --output_len requires a value"
+        exit 1
+      fi
+      OUTPUT_LEN="$1"
+      shift
+      ;;
+    *)
+      echo "Unknown argument: $1"
+      echo "Usage: $0 --model_id <model_id> [--batch_sizes <batch_sizes>] [--input_len <input_len>] [--output_len <output_len>]"
+      exit 1
+      ;;
+  esac
+done
+if [[ ${#MODEL_ID_ARRAY[@]} -eq 0 ]]; then
+  echo "Error: --model_ids is required"
+  echo "Usage: $0 --model_ids <model_id1> <model_id2> ... [--batch_sizes <batch_size1> <batch_size2> ...] [--input_len <input_len>] [--output_len <output_len>]"
+  exit 1
+fi
+# Save the original directory
+ORIG_DIR="$(pwd)"
+# cd to VLLM_DIR
+cd $VLLM_DIR
+for MODEL_ID in "${MODEL_ID_ARRAY[@]}"; do
+    echo "======================== Eval Latency $MODEL_ID ==========================="
+    # Replace all '/' with '_'
+    SAFE_MODEL_ID="${MODEL_ID//\//_}"
+    # Loop over batch sizes and print (replace with your eval command)
+    for BATCH_SIZE in "${BATCH_SIZE_ARRAY[@]}"; do
+        OUTPUT_FILE="$ORIG_DIR/${SAFE_MODEL_ID}_latency_batch${BATCH_SIZE}_in${INPUT_LEN}_out${OUTPUT_LEN}.log"
+        echo "Running latency eval for model $MODEL_ID with batch size $BATCH_SIZE with input length: $INPUT_LEN and output length: $OUTPUT_LEN"
+        VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len $INPUT_LEN --output-len $OUTPUT_LEN --model $MODEL_ID --batch-size $BATCH_SIZE > "$OUTPUT_FILE" 2>&1
+        echo "Latency eval result saved to $OUTPUT_FILE"
+    done
+    echo "======================== Eval Latency $MODEL_ID End ========================="
+done
+
+# cd back to original place
+cd $ORIG_DIR