Add the initial version of Claude skills (#17284)

larryliu0820 · web-flow · commit ba2516cefa4e · 2026-02-07T21:21:29.000-08:00
Adding the first version of Claude skills so that it knows how to export
a model and debug. Apparently still missing a lot of details in all
these backends and we should keep iterating on them.
diff --git a/.claude/backends.md b/.claude/backends.md
@@ -0,0 +1,34 @@
+# Backends
+
+| Backend | Platform | Hardware | Location |
+|---------|----------|----------|----------|
+| XNNPACK | All | CPU | `backends/xnnpack/` |
+| CUDA | Linux/Windows | GPU | `backends/cuda/` |
+| CoreML | iOS, macOS | NPU/GPU/CPU | `backends/apple/coreml/` |
+| MPS | iOS, macOS | GPU | `backends/apple/mps/` |
+| Vulkan | Android | GPU | `backends/vulkan/` |
+| QNN | Android | NPU | `backends/qualcomm/` |
+| MediaTek | Android | NPU | `backends/mediatek/` |
+| Arm Ethos-U | Embedded | NPU | `backends/arm/` |
+| OpenVINO | Embedded | CPU/GPU/NPU | `backends/openvino/` |
+| Cadence | Embedded | DSP | See `backends-cadence.md` |
+| Samsung | Android | NPU | `backends/samsung/` |
+
+## Partitioner imports
+```python
+from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
+from executorch.backends.apple.coreml.partition.coreml_partitioner import CoreMLPartitioner
+from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner
+from executorch.backends.vulkan.partition.vulkan_partitioner import VulkanPartitioner
+```
+
+## Usage pattern
+```python
+from executorch.exir import to_edge
+
+edge = to_edge(exported_program)
+edge = edge.to_backend(XnnpackPartitioner())  # or other partitioner
+exec_prog = edge.to_executorch()
+```
+
+Unsupported ops fall back to portable CPU. Use multiple partitioners for priority fallback.
diff --git a/.claude/faq.md b/.claude/faq.md
@@ -0,0 +1,35 @@
+# Common Errors
+
+## Error Codes
+Error codes defined in `runtime/core/error.h`.
+
+| Code | Name | Common Cause |
+|------|------|--------------|
+| 0x10 | InvalidArgument | Input shape mismatch - inputs don't match export shapes. Use dynamic shapes if needed. |
+| 0x14 | OperatorMissing | Selective build missing operator. Regenerate `et_operator_library` from current model. |
+| 0x20 | NotFound | Missing backend. Link with `--whole-archive`: `-Wl,--whole-archive libxnnpack_backend.a -Wl,--no-whole-archive` |
+
+## Export Issues
+
+**Missing out variants**: Custom ops need ExecuTorch implementation. See `kernel-library-custom-aten-kernel.md`.
+
+**RuntimeError: convert function not implemented**: Unsupported operator. File GitHub issue.
+
+## Runtime Issues
+
+**Slow inference**:
+1. Build with `-DCMAKE_BUILD_TYPE=Release`
+2. Ensure model is delegated (use `XnnpackPartitioner`)
+3. Set thread count: `threadpool::get_threadpool()->_unsafe_reset_threadpool(num_threads)`
+
+**Numerical accuracy**: Use devtools to debug. See `/profile` skill.
+
+**Error setting input 0x10**: Input shape mismatch. Specify dynamic shapes at export.
+
+**Duplicate kernel registration abort**: Multiple `gen_operators_lib` linked. Use only one per target.
+
+## Installation
+
+**Missing python-dev**: `sudo apt install python<version>-dev`
+
+**Missing pytorch_tokenizers**: `pip install -e ./extension/llm/tokenizers/`
diff --git a/.claude/llm-export.md b/.claude/llm-export.md
@@ -0,0 +1,65 @@
+# LLM Export
+
+High-level API for exporting LLMs to .pte format.
+
+## Supported Models
+Llama 2/3/3.1/3.2, Qwen 2.5/3, Phi 3.5/4-mini, SmolLM2
+
+Full list: `extension/llm/export/config/llm_config.py`
+
+For other models (Gemma, Mistral, BERT, Whisper): use optimum-executorch (see `/setup` skill).
+
+## Basic Usage
+
+```bash
+python -m executorch.extension.llm.export.export_llm \
+  --config path/to/config.yaml
+```
+
+## Config Structure
+
+```yaml
+base:
+  model_class: llama3_2
+  checkpoint: path/to/consolidated.00.pth
+  params: path/to/params.json
+  metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
+
+model:
+  use_kv_cache: True              # recommended
+  use_sdpa_with_kv_cache: True    # recommended
+  use_attention_sink: False       # extend generation
+  quantize_kv_cache: False        # int8 KV cache
+
+quantization:
+  qmode: 8da4w                    # int8 activation + int4 weight
+  group_size: 32
+  embedding_quantize: 4,32
+
+backend:
+  xnnpack:
+    enabled: True
+    extended_ops: True
+
+debug:
+  verbose: True                   # show delegation table
+  generate_etrecord: True         # for devtools profiling
+```
+
+## Quantization Modes
+
+**TorchAO (XNNPACK)**:
+- `8da4w`: int8 dynamic activation + int4 weight
+- `int8`: int8 weight-only
+- `torchao:8da4w`: low-bit kernels for Arm
+
+**pt2e (QNN, CoreML, Vulkan)**: Use for non-CPU backends.
+
+## Config Classes
+All options in `extension/llm/export/config/llm_config.py`:
+- `LlmConfig` - top level
+- `ExportConfig` - max_seq_length, max_context_length
+- `ModelConfig` - model optimizations
+- `QuantizationConfig` - quantization options
+- `BackendConfig` - backend settings
+- `DebugConfig` - verbose, etrecord, profiling
diff --git a/.claude/quantization.md b/.claude/quantization.md
@@ -0,0 +1,13 @@
+# Quantization
+
+Docs: https://docs.pytorch.org/ao/main/pt2e_quantization/index.html
+
+## Backend quantizers
+| Backend | Quantizer |
+|---------|-----------|
+| XNNPACK | `XNNPACKQuantizer` |
+| Qualcomm | `QnnQuantizer` |
+| CoreML | `CoreMLQuantizer` |
+
+## LLM modes
+See `examples/models/llama/source_transformation/quantize.py`: `int8`, `8da4w`, `4w`
diff --git a/.claude/runtime-api.md b/.claude/runtime-api.md
@@ -0,0 +1,28 @@
+# Runtime API
+
+## executorch.runtime (preferred)
+```python
+from executorch.runtime import Runtime, Program, Method
+runtime = Runtime.get()
+program = runtime.load_program(Path("model.pte"))
+outputs = program.load_method("forward").execute(inputs)
+```
+
+## portable_lib (low-level)
+```python
+from executorch.extension.pybindings.portable_lib import _load_for_executorch
+module = _load_for_executorch("model.pte")
+outputs = module.forward(inputs)
+```
+
+## Missing kernel fixes
+
+If runtime shows missing kernel errors, import the kernel module before loading:
+
+```python
+# Missing quantized kernels (e.g., quantized_decomposed::embedding_byte.out)
+from executorch.kernels import quantized
+
+# Missing LLM custom ops (e.g., llama::custom_sdpa.out, llama::update_cache.out)
+from executorch.extension.llm.custom_ops import custom_ops
+```
diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md
@@ -0,0 +1,23 @@
+---
+name: building
+description: Build ExecuTorch runners or C++ libraries. Use when compiling runners for Llama, Whisper, or other models, or building the C++ runtime.
+---
+
+# Building
+
+## Runners (Makefile)
+```bash
+make help              # list all targets
+make llama-cpu         # Llama
+make whisper-metal     # Whisper on Metal
+make gemma3-cuda       # Gemma3 on CUDA
+```
+
+Output: `cmake-out/examples/models/<model>/<runner>`
+
+## C++ Libraries (CMake)
+```bash
+cmake --list-presets                    # list presets
+cmake --workflow --preset llm-release   # LLM CPU
+cmake --workflow --preset llm-release-metal  # LLM Metal
+```
diff --git a/.claude/skills/export/SKILL.md b/.claude/skills/export/SKILL.md
@@ -0,0 +1,28 @@
+---
+name: export
+description: Export a PyTorch model to .pte format for ExecuTorch. Use when converting models, lowering to edge, or generating .pte files.
+---
+
+# Export
+
+## Basic pattern
+```python
+from executorch.exir import to_edge_transform_and_lower
+from torch.export import export
+
+exported = export(model.eval(), example_inputs)
+edge = to_edge_transform_and_lower(exported)
+with open("model.pte", "wb") as f:
+    f.write(edge.to_executorch().buffer)
+```
+
+## Model-specific scripts
+| Model | Script |
+|-------|--------|
+| Llama | `examples/models/llama/export_llama.py` |
+| Whisper | `examples/models/whisper/` |
+| Parakeet | `examples/models/parakeet/export_parakeet_tdt.py` |
+
+## Debugging
+- Draft export: `export(model, inputs, strict=False)`
+- tlparse: `TORCH_LOGS="+dynamo,+export" python script.py 2>&1 | tlparse`
diff --git a/.claude/skills/profile/SKILL.md b/.claude/skills/profile/SKILL.md
@@ -0,0 +1,24 @@
+---
+name: profile
+description: Profile ExecuTorch model execution. Use when measuring performance, analyzing operator timing, or debugging slow models.
+---
+
+# Profile
+
+## 1. Enable ETDump when loading
+```python
+program = runtime.load_program("model.pte", enable_etdump=True, debug_buffer_size=int(1e7))
+```
+
+## 2. Execute and save
+```python
+outputs = program.load_method("forward").execute(inputs)
+program.write_etdump_result_to_file("etdump.etdp", "debug.bin")
+```
+
+## 3. Analyze with Inspector
+```python
+from executorch.devtools import Inspector
+inspector = Inspector(etrecord="model.etrecord", etdump_path="etdump.etdp")
+inspector.print_data_tabular()
+```
diff --git a/.claude/skills/setup/SKILL.md b/.claude/skills/setup/SKILL.md
@@ -0,0 +1,15 @@
+---
+name: setup
+description: Set up ExecuTorch development environment. Use when installing dependencies, setting up conda environments, or preparing to develop with ExecuTorch.
+---
+
+# Setup
+
+1. Activate conda: `conda activate executorch`
+   - If not found: `conda env list | grep -E "(executorch|et)"`
+
+2. Install executorch: `./install_executorch.sh`
+
+3. (Optional) For Huggingface integration:
+   - Read commit from `.ci/docker/ci_commit_pins/optimum-executorch.txt`
+   - Install: `pip install git+https://github.com/huggingface/optimum-executorch.git@<COMMIT>`
diff --git a/.claude/tokenizers.md b/.claude/tokenizers.md
@@ -0,0 +1,54 @@
+# Tokenizers
+
+C++ tokenizer implementations with Python bindings. Located in `extension/llm/tokenizers/`.
+
+## Installation
+```bash
+pip install -e ./extension/llm/tokenizers/
+```
+
+## Python API
+
+```python
+from pytorch_tokenizers import get_tokenizer
+
+# Auto-detect tokenizer type from file
+tokenizer = get_tokenizer("path/to/tokenizer.model")  # or .json
+
+# Encode/decode
+tokens = tokenizer.encode("Hello world")
+text = tokenizer.decode(tokens)
+```
+
+## Available Tokenizers
+
+| Class | Format | Use Case |
+|-------|--------|----------|
+| `HuggingFaceTokenizer` | `.json` | HuggingFace models |
+| `TiktokenTokenizer` | `.model` | OpenAI/Llama 3 |
+| `Llama2cTokenizer` | `.model` | Llama 2, SentencePiece |
+| `CppSPTokenizer` | `.model` | SentencePiece (C++) |
+
+## Direct Usage
+
+```python
+from pytorch_tokenizers import HuggingFaceTokenizer, TiktokenTokenizer, Llama2cTokenizer
+
+# HuggingFace (tokenizer.json)
+tokenizer = HuggingFaceTokenizer("tokenizer.json", "tokenizer_config.json")
+
+# Tiktoken (Llama 3, etc.)
+tokenizer = TiktokenTokenizer(model_path="tokenizer.model")
+
+# Llama2c/SentencePiece
+tokenizer = Llama2cTokenizer(model_path="tokenizer.model")
+```
+
+## C++ Tokenizers
+
+For C++ runners, include headers from `extension/llm/tokenizers/include/`:
+- `hf_tokenizer.h` - HuggingFace
+- `tiktoken.h` - Tiktoken
+- `sentencepiece.h` - SentencePiece
+- `llama2c_tokenizer.h` - Llama2c
+- `tekken.h` - Mistral Tekken v7
diff --git a/CLAUDE.md b/CLAUDE.md