[Feature Request / Question] How to achieve Token-Level Weighting for Dataset Blending in Energon?

## The Problem
Currently, when blending multiple pre-training datasets using Megatron-Energon (e.g., via MetadatasetV2 with blend or blend_epochized), the sampling ratios defined by weight or repetitions are strictly sample-level (document-level).
While this design works well for uniform data (like images of the same resolution), it creates a severe token imbalance when mixing text corpora from vastly different domains. Pre-training datasets inherently vary in document length—for example, short social media posts versus lengthy code repositories or books.
## Example Scenario:
Suppose we want to mix a Web Crawl corpus (Dataset A) and a Books corpus (Dataset B) with a target Token ratio of 1:1 during pre-training.
Dataset A (Web): Average length = 400 tokens/document.
Dataset B (Books): Average length = 4,000 tokens/document.
If we naively set the weight to 1:1 in the metadataset.yaml, Energon will sample an equal number of documents from both. However, after the TaskEncoder packs these documents into sequences (e.g., seq_length=4096), the actual tokens fed into the model will be overwhelmingly dominated by Dataset B (a 10:1 token ratio in reality), completely disrupting the intended 1:1 mixture distribution.
## Current Workaround
Currently, we have to manually pre-calculate the average token length of every single dataset in our pre-training mix and inversely scale the weight parameter in the YAML config to compensate for the length disparity.
(e.g., setting weight: 10 for Dataset A and weight: 1 for Dataset B).
This workaround is highly error-prone, cumbersome, and impractical for large-scale pre-training data mixtures containing hundreds of distinct domains, especially since average lengths fluctuate as corpora are updated or re-filtered.
## Questions / Seeking Advice
Are there any existing best practices or built-in features in Energon to handle this document length discrepancy gracefully?
If not, is there any plan to support Token-Level Blending (or Length-Aware Blending) natively in future releases?
Precise control over the token domain distribution is critical for model quality in LLM pre-training. We would love to hear your thoughts or recommended solutions on this issue.
@tabo @hartsock @jaredcasper @aaronp24 @aflat 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request / Question] How to achieve Token-Level Weighting for Dataset Blending in Energon? #212

The Problem

Example Scenario:

Current Workaround

Questions / Seeking Advice

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request / Question] How to achieve Token-Level Weighting for Dataset Blending in Energon? #212

Description

The Problem

Example Scenario:

Current Workaround

Questions / Seeking Advice

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions