What happened:
I ran examples/llm-agent/singletask_learning_bench end-to-end following its README (macOS ARM64, Python 3.12, torch with MPS). Training completes (20 epochs, train_loss 1.39), but inference crashes:
RuntimeError: Placeholder storage has not been allocated on MPS device!
The failure is in predict() at testalgorithms/basemodel.py:76, inside self.model.generate(**inputs, ...). Transformers warns right before the crash:
UserWarning: You are calling .generate() with the `input_ids` being on a device type
different than your model's device. `input_ids` is on cpu, whereas the model is on mps.
Root cause 1 — inputs never moved to the model device
predict() tokenizes with return_tensors="pt" but leaves the tensors on CPU, while the model is loaded with device_map from config.json ("auto"). On CUDA, accelerate's hooks happen to paper over this; on MPS it hard-crashes. One-line fix:
inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True,
max_length=self.MAX_LENGTH).to(self.model.device)
Root cause 2 (found while debugging) — half_lora is a string, so fp16 never runs, and the fp16 path is broken anyway
train_config.json sets "half_lora": "True" (a string). The code checks if half==True:, which is always False for a string, so model.half() never executes — but adam_epsilon=(1e-4 if half else 1e-8) treats the non-empty string as truthy, so the fp16-tuned epsilon is silently applied to fp32 training.
Worse: if you fix the comparison so fp16 actually engages, training silently collapses — model.half() without loss scaling underflows the gradients:
| run |
half active |
train_loss |
rouge1 / rougeL |
| fp32 (current effective behavior) |
no |
1.39 |
10.0 / 10.0 (matches README) |
| fp16 via fixed comparison |
yes |
0.0 |
0.0 / 0.0 |
So the string bug has been masking a numerically broken fp16 path on every platform, not just macOS.
Impact:
- The example is unrunnable on Apple Silicon, although the README lists macOS as supported.
- Anyone fixing the
half_lora comparison locally gets silent garbage results (loss 0, all metrics 0) with no error — the worst failure mode for a benchmark.
Proposed fix (example-only, no ianvs core changes):
- Move tokenized inputs to
self.model.device in predict().
- Parse
half_lora as a real boolean and default it to false, with a note that proper fp16 should go through TrainingArguments(fp16=True) on CUDA.
I have both fixes working locally, verified against the README's expected leaderboard. PR incoming.
Environment: macOS ARM64, Python 3.12.13, ianvs @ bf0f596, deps per the example's requirements.txt.
What happened:
I ran
examples/llm-agent/singletask_learning_benchend-to-end following its README (macOS ARM64, Python 3.12, torch with MPS). Training completes (20 epochs, train_loss 1.39), but inference crashes:The failure is in
predict()attestalgorithms/basemodel.py:76, insideself.model.generate(**inputs, ...). Transformers warns right before the crash:Root cause 1 — inputs never moved to the model device
predict()tokenizes withreturn_tensors="pt"but leaves the tensors on CPU, while the model is loaded withdevice_mapfromconfig.json("auto"). On CUDA, accelerate's hooks happen to paper over this; on MPS it hard-crashes. One-line fix:Root cause 2 (found while debugging) —
half_lorais a string, so fp16 never runs, and the fp16 path is broken anywaytrain_config.jsonsets"half_lora": "True"(a string). The code checksif half==True:, which is always False for a string, somodel.half()never executes — butadam_epsilon=(1e-4 if half else 1e-8)treats the non-empty string as truthy, so the fp16-tuned epsilon is silently applied to fp32 training.Worse: if you fix the comparison so fp16 actually engages, training silently collapses —
model.half()without loss scaling underflows the gradients:So the string bug has been masking a numerically broken fp16 path on every platform, not just macOS.
Impact:
half_loracomparison locally gets silent garbage results (loss 0, all metrics 0) with no error — the worst failure mode for a benchmark.Proposed fix (example-only, no ianvs core changes):
self.model.deviceinpredict().half_loraas a real boolean and default it tofalse, with a note that proper fp16 should go throughTrainingArguments(fp16=True)on CUDA.I have both fixes working locally, verified against the README's expected leaderboard. PR incoming.
Environment: macOS ARM64, Python 3.12.13, ianvs @ bf0f596, deps per the example's requirements.txt.