Skip to content

ONNX adapter created by "olive convert-adapters" command cannot work with ONNX model created by "olive auto-opt" #2277

@zhenchaoni

Description

@zhenchaoni

Describe the bug
I have a Hugging Face model "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" and a PEFT adapter. I use auto-opt command and both HF model & PEFT adapter as inputs to generate the ONNX model. I use convert-adapters command and PEFT adapter as input to generate the ONNX adapter file. However, the ONNX model and the ONNX adapter cannot work. The runtime error is "RuntimeError: Invalid input name: model.layers.12.self_attn.v_proj.lora_A.weight"

To Reproduce
generate a PEFT adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)

# Create a LoRA config
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    lora_dropout=0.0,
)

# Create a PEFT model wrapper
peft_model = get_peft_model(model, lora_config)

# Optionally train the model. But this won't impact the repro of the bug

# Save the LoRA adapter
peft_model.save_pretrained("empty_lora")

generate the ONNX model

Please note, both the HF model name and the PEFT adapter are inputs. auto-opt internally uses ModelBuilder and ExtractAdapter pass. Therefore, auto-opt can generate the ONNX model which has adapter slots and an ONNX adapter file. We use the ONNX model only for the repro.

olive auto-opt
    --model_name_or_path "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" 
    --adapter_path empty_lora 
    --device cpu 
    --provider CPUExecutionProvider 
    --use_model_builder 
    --output_path basemodel-with-slots 
    --log_level 0

generate ONNX adapter file

olive convert-adapters 
    --adapter_path empty_lora 
    --output_path convert_adapter_result 
    --log_level 0

inference

I mostly leverages the inference code from olive example. Paste the same code below

import onnxruntime_genai as og
import time

model_folder = "basemodel-with-slots" #olive auto-opt generated
#adapter_path = "basemodel-with-slots/adapter_weights.onnx_adapter" #olive auto-opt generated, inference OK
adapter_path = "convert_adapter_result.onnx_adapter" #olive convert-adapters generated, cannot inference

# Load the base model and tokenizer
model = og.Model(model_folder)
print(dir(model))
adapters = og.Adapters(model) #Adapter code
adapters.load(adapter_path, "en_medical_reasoning") #Adapter code
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

prompt_template = """
Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
"""

question = """
        A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with 
        dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?
"""
prompt = prompt_template.format(question, "")

# Encode the prompt using the tokenizer
input_tokens = tokenizer.encode(prompt)

# Create params and generator
params = og.GeneratorParams(model)
generator = og.Generator(model, params)
generator.set_active_adapter(adapters, "en_medical_reasoning") #Adapter code

# Append input tokens to the generator
generator.append_tokens(input_tokens)

print("")
print("Output: ", end="", flush=True)

token_times = []

# Stream the output
while True:
    start_time = time.time()
    if generator.is_done():
        break
    generator.generate_next_token()
    end_time = time.time()
    
    # Record the time for this token generation
    token_time = end_time - start_time
    token_times.append(token_time)

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

print()

# Calculate and display timing statistics
if token_times:
    total_tokens = len(token_times)
    avg_time = sum(token_times) / total_tokens
    
    print(f"Total tokens generated: {total_tokens}")
    print(f"Average time per token: {avg_time:.4f} seconds")
    print(f"Tokens per second: {total_tokens / sum(token_times):.2f}")

del generator

Actual behavior

  • The ONNX adapter file generated by convert-adapter cannot work with the ONNX model generated by auto-opt. By inspecting the ONNX model, I think the root cause is the adapter input name in the model and in the adapter file don't match.
  • The ONNX adapter file and ONNX model both generated by auto-opt can work. But this is not what the issue complains about.

Expected behavior
The ONNX adapter file generated by convert-adapter should work with the ONNX model generated by auto-opt.
If this issue is fixed, then I just need to create ONNX model once with auto-opt command. Every time I do a new finetuning, I just need to convert the PEFT adapter to ONNX adapter without generating the ONNX format model again.

Other information

  • OS: Windows
  • Olive version: 0.10.1
  • ONNXRuntime package and version: onnxruntime 1.23.2, onnxruntime_genai 0.10.0
  • Transformers package version: [e.g. transformers 4.57.1]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions