Skip to content

llm-agent example: inference crashes on MPS, and half_lora is silently ignored (masking broken fp16 training) #533

Description

@veyron-kairo

What happened:

I ran examples/llm-agent/singletask_learning_bench end-to-end following its README (macOS ARM64, Python 3.12, torch with MPS). Training completes (20 epochs, train_loss 1.39), but inference crashes:

RuntimeError: Placeholder storage has not been allocated on MPS device!

The failure is in predict() at testalgorithms/basemodel.py:76, inside self.model.generate(**inputs, ...). Transformers warns right before the crash:

UserWarning: You are calling .generate() with the `input_ids` being on a device type
different than your model's device. `input_ids` is on cpu, whereas the model is on mps.

Root cause 1 — inputs never moved to the model device

predict() tokenizes with return_tensors="pt" but leaves the tensors on CPU, while the model is loaded with device_map from config.json ("auto"). On CUDA, accelerate's hooks happen to paper over this; on MPS it hard-crashes. One-line fix:

inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True,
                        max_length=self.MAX_LENGTH).to(self.model.device)

Root cause 2 (found while debugging) — half_lora is a string, so fp16 never runs, and the fp16 path is broken anyway

train_config.json sets "half_lora": "True" (a string). The code checks if half==True:, which is always False for a string, so model.half() never executes — but adam_epsilon=(1e-4 if half else 1e-8) treats the non-empty string as truthy, so the fp16-tuned epsilon is silently applied to fp32 training.

Worse: if you fix the comparison so fp16 actually engages, training silently collapses — model.half() without loss scaling underflows the gradients:

run half active train_loss rouge1 / rougeL
fp32 (current effective behavior) no 1.39 10.0 / 10.0 (matches README)
fp16 via fixed comparison yes 0.0 0.0 / 0.0

So the string bug has been masking a numerically broken fp16 path on every platform, not just macOS.

Impact:

  • The example is unrunnable on Apple Silicon, although the README lists macOS as supported.
  • Anyone fixing the half_lora comparison locally gets silent garbage results (loss 0, all metrics 0) with no error — the worst failure mode for a benchmark.

Proposed fix (example-only, no ianvs core changes):

  1. Move tokenized inputs to self.model.device in predict().
  2. Parse half_lora as a real boolean and default it to false, with a note that proper fp16 should go through TrainingArguments(fp16=True) on CUDA.

I have both fixes working locally, verified against the README's expected leaderboard. PR incoming.

Environment: macOS ARM64, Python 3.12.13, ianvs @ bf0f596, deps per the example's requirements.txt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions