SFT data preparation filters out a lot of samples.

When preparing data for SFT, a lot of rows are getting filtered out.
The issue occurs when the script applies the chat template for partial conversations [here](https://github.com/NVIDIA-NeMo/Nemotron/blob/acdf8fc6c5be40b447ef96e5656a4e0d4f28081b/src/nemotron/data_prep/core/chat_template.py#L202) (where a newline is added at the end) and then tries to check whether the string matches as a prefix for the full template [here](https://github.com/NVIDIA-NeMo/Nemotron/blob/acdf8fc6c5be40b447ef96e5656a4e0d4f28081b/src/nemotron/data_prep/core/chat_template.py#L214)

This can be fixed by adding a `.strip()` after applying chat template.

```python
            # Tool and user messages need generation prompt, others don't
            add_gen_prompt = messages[i]["role"] == "tool" or messages[i]["role"] == "user"
            template_up_to_here = tokenizer.apply_chat_template(
                messages[: i + 1],
                tokenize=False,
                add_generation_prompt=add_gen_prompt,
                tools=tools,
                chat_template_kwargs={"enable_thinking": enable_thinking},
            ).strip()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT data preparation filters out a lot of samples. #184

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SFT data preparation filters out a lot of samples. #184

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions