llm-agent example: inference crashes on MPS, and half_lora is silently ignored (masking broken fp16 training)

**What happened**:

I ran `examples/llm-agent/singletask_learning_bench` end-to-end following its README (macOS ARM64, Python 3.12, torch with MPS). Training completes (20 epochs, train_loss 1.39), but inference crashes:

```
RuntimeError: Placeholder storage has not been allocated on MPS device!
```

The failure is in `predict()` at `testalgorithms/basemodel.py:76`, inside `self.model.generate(**inputs, ...)`. Transformers warns right before the crash:

```
UserWarning: You are calling .generate() with the `input_ids` being on a device type
different than your model's device. `input_ids` is on cpu, whereas the model is on mps.
```

**Root cause 1 — inputs never moved to the model device**

`predict()` tokenizes with `return_tensors="pt"` but leaves the tensors on CPU, while the model is loaded with `device_map` from `config.json` (`"auto"`). On CUDA, accelerate's hooks happen to paper over this; on MPS it hard-crashes. One-line fix:

```python
inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True,
                        max_length=self.MAX_LENGTH).to(self.model.device)
```

**Root cause 2 (found while debugging) — `half_lora` is a string, so fp16 never runs, and the fp16 path is broken anyway**

`train_config.json` sets `"half_lora": "True"` (a string). The code checks `if half==True:`, which is always False for a string, so `model.half()` never executes — but `adam_epsilon=(1e-4 if half else 1e-8)` treats the non-empty string as truthy, so the fp16-tuned epsilon is silently applied to fp32 training.

Worse: if you fix the comparison so fp16 actually engages, training silently collapses — `model.half()` without loss scaling underflows the gradients:

| run | half active | train_loss | rouge1 / rougeL |
|-----|-------------|------------|-----------------|
| fp32 (current effective behavior) | no | 1.39 | 10.0 / 10.0 (matches README) |
| fp16 via fixed comparison | yes | 0.0 | 0.0 / 0.0 |

So the string bug has been masking a numerically broken fp16 path on every platform, not just macOS.

**Impact**:

- The example is unrunnable on Apple Silicon, although the README lists macOS as supported.
- Anyone fixing the `half_lora` comparison locally gets silent garbage results (loss 0, all metrics 0) with no error — the worst failure mode for a benchmark.

**Proposed fix** (example-only, no ianvs core changes):

1. Move tokenized inputs to `self.model.device` in `predict()`.
2. Parse `half_lora` as a real boolean and default it to `false`, with a note that proper fp16 should go through `TrainingArguments(fp16=True)` on CUDA.

I have both fixes working locally, verified against the README's expected leaderboard. PR incoming.

**Environment**: macOS ARM64, Python 3.12.13, ianvs @ bf0f596, deps per the example's requirements.txt.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-agent example: inference crashes on MPS, and half_lora is silently ignored (masking broken fp16 training) #533

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

run	half active	train_loss	rouge1 / rougeL
fp32 (current effective behavior)	no	1.39	10.0 / 10.0 (matches README)
fp16 via fixed comparison	yes	0.0	0.0 / 0.0

llm-agent example: inference crashes on MPS, and half_lora is silently ignored (masking broken fp16 training) #533

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions