|
| 1 | +# LLM Export |
| 2 | + |
| 3 | +High-level API for exporting LLMs to .pte format. |
| 4 | + |
| 5 | +## Supported Models |
| 6 | +Llama 2/3/3.1/3.2, Qwen 2.5/3, Phi 3.5/4-mini, SmolLM2 |
| 7 | + |
| 8 | +Full list: `extension/llm/export/config/llm_config.py` |
| 9 | + |
| 10 | +For other models (Gemma, Mistral, BERT, Whisper): use optimum-executorch (see `/setup` skill). |
| 11 | + |
| 12 | +## Basic Usage |
| 13 | + |
| 14 | +```bash |
| 15 | +python -m executorch.extension.llm.export.export_llm \ |
| 16 | + --config path/to/config.yaml |
| 17 | +``` |
| 18 | + |
| 19 | +## Config Structure |
| 20 | + |
| 21 | +```yaml |
| 22 | +base: |
| 23 | + model_class: llama3_2 |
| 24 | + checkpoint: path/to/consolidated.00.pth |
| 25 | + params: path/to/params.json |
| 26 | + metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' |
| 27 | + |
| 28 | +model: |
| 29 | + use_kv_cache: True # recommended |
| 30 | + use_sdpa_with_kv_cache: True # recommended |
| 31 | + use_attention_sink: False # extend generation |
| 32 | + quantize_kv_cache: False # int8 KV cache |
| 33 | + |
| 34 | +quantization: |
| 35 | + qmode: 8da4w # int8 activation + int4 weight |
| 36 | + group_size: 32 |
| 37 | + embedding_quantize: 4,32 |
| 38 | + |
| 39 | +backend: |
| 40 | + xnnpack: |
| 41 | + enabled: True |
| 42 | + extended_ops: True |
| 43 | + |
| 44 | +debug: |
| 45 | + verbose: True # show delegation table |
| 46 | + generate_etrecord: True # for devtools profiling |
| 47 | +``` |
| 48 | +
|
| 49 | +## Quantization Modes |
| 50 | +
|
| 51 | +**TorchAO (XNNPACK)**: |
| 52 | +- `8da4w`: int8 dynamic activation + int4 weight |
| 53 | +- `int8`: int8 weight-only |
| 54 | +- `torchao:8da4w`: low-bit kernels for Arm |
| 55 | + |
| 56 | +**pt2e (QNN, CoreML, Vulkan)**: Use for non-CPU backends. |
| 57 | + |
| 58 | +## Config Classes |
| 59 | +All options in `extension/llm/export/config/llm_config.py`: |
| 60 | +- `LlmConfig` - top level |
| 61 | +- `ExportConfig` - max_seq_length, max_context_length |
| 62 | +- `ModelConfig` - model optimizations |
| 63 | +- `QuantizationConfig` - quantization options |
| 64 | +- `BackendConfig` - backend settings |
| 65 | +- `DebugConfig` - verbose, etrecord, profiling |
0 commit comments