Skip to content

[Feature Request / Question] How to achieve Token-Level Weighting for Dataset Blending in Energon? #212

@hannlp

Description

@hannlp

The Problem

Currently, when blending multiple pre-training datasets using Megatron-Energon (e.g., via MetadatasetV2 with blend or blend_epochized), the sampling ratios defined by weight or repetitions are strictly sample-level (document-level).
While this design works well for uniform data (like images of the same resolution), it creates a severe token imbalance when mixing text corpora from vastly different domains. Pre-training datasets inherently vary in document length—for example, short social media posts versus lengthy code repositories or books.

Example Scenario:

Suppose we want to mix a Web Crawl corpus (Dataset A) and a Books corpus (Dataset B) with a target Token ratio of 1:1 during pre-training.
Dataset A (Web): Average length = 400 tokens/document.
Dataset B (Books): Average length = 4,000 tokens/document.
If we naively set the weight to 1:1 in the metadataset.yaml, Energon will sample an equal number of documents from both. However, after the TaskEncoder packs these documents into sequences (e.g., seq_length=4096), the actual tokens fed into the model will be overwhelmingly dominated by Dataset B (a 10:1 token ratio in reality), completely disrupting the intended 1:1 mixture distribution.

Current Workaround

Currently, we have to manually pre-calculate the average token length of every single dataset in our pre-training mix and inversely scale the weight parameter in the YAML config to compensate for the length disparity.
(e.g., setting weight: 10 for Dataset A and weight: 1 for Dataset B).
This workaround is highly error-prone, cumbersome, and impractical for large-scale pre-training data mixtures containing hundreds of distinct domains, especially since average lengths fluctuate as corpora are updated or re-filtered.

Questions / Seeking Advice

Are there any existing best practices or built-in features in Energon to handle this document length discrepancy gracefully?
If not, is there any plan to support Token-Level Blending (or Length-Aware Blending) natively in future releases?
Precise control over the token domain distribution is critical for model quality in LLM pre-training. We would love to hear your thoughts or recommended solutions on this issue.
@tabo @hartsock @jaredcasper @aaronp24 @aflat

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions