When preparing data for SFT, a lot of rows are getting filtered out.
The issue occurs when the script applies the chat template for partial conversations here (where a newline is added at the end) and then tries to check whether the string matches as a prefix for the full template here
This can be fixed by adding a .strip() after applying chat template.
# Tool and user messages need generation prompt, others don't
add_gen_prompt = messages[i]["role"] == "tool" or messages[i]["role"] == "user"
template_up_to_here = tokenizer.apply_chat_template(
messages[: i + 1],
tokenize=False,
add_generation_prompt=add_gen_prompt,
tools=tools,
chat_template_kwargs={"enable_thinking": enable_thinking},
).strip()
When preparing data for SFT, a lot of rows are getting filtered out.
The issue occurs when the script applies the chat template for partial conversations here (where a newline is added at the end) and then tries to check whether the string matches as a prefix for the full template here
This can be fixed by adding a
.strip()after applying chat template.