Skip to content

SFT data preparation filters out a lot of samples. #184

@kipraveen

Description

@kipraveen

When preparing data for SFT, a lot of rows are getting filtered out.
The issue occurs when the script applies the chat template for partial conversations here (where a newline is added at the end) and then tries to check whether the string matches as a prefix for the full template here

This can be fixed by adding a .strip() after applying chat template.

            # Tool and user messages need generation prompt, others don't
            add_gen_prompt = messages[i]["role"] == "tool" or messages[i]["role"] == "user"
            template_up_to_here = tokenizer.apply_chat_template(
                messages[: i + 1],
                tokenize=False,
                add_generation_prompt=add_gen_prompt,
                tools=tools,
                chat_template_kwargs={"enable_thinking": enable_thinking},
            ).strip()

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions