Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Jan 26, 2026

Byte Tokenization Reports

These changes add a new feature --byte_tokenization_report to prepare.py that shows the amount and % of byte tokens vs non-byte tokens in the dataset.

Current sample scripts are located within the Flores-200 directory.

Testing:

cd data/flores200-res
bash get_dataset.sh
bash byte_report.sh
python3 utils/plot_byte_report.py tiktoken*/*.yaml --percent-plot stacked

Sample Graphs

image image

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds byte-tokenization reporting to the dataset preparation workflow and supporting scripts to generate/visualize byte vs non-byte token ratios.

Changes:

  • Add --report_byte_tokenization support to tokenizers + prepare.py, including per-split (train/val) report printing and optional YAML report writing.
  • Introduce a Plotly utility to visualize byte-token reports across runs.
  • Add/adjust Flores-200 helper scripts for generating byte-tokenization reports.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
data/template/utils/plot_byte_report.py New Plotly-based report visualizer for YAML byte-tokenization outputs.
data/template/tokenizers.py Adds byte-token accounting/reporting hooks and byte-token detection in relevant tokenizers; adjusts CharBPE JSON output path handling.
data/template/prepare.py Adds CLI flag + helper functions to reset/print/write byte-tokenization reports during tokenization; ensures .bin writing respects --skip_tokenization.
data/flores200-res/get_dataset.sh Removes one language entry from the sample dataset list.
data/flores200-res/byte_report.sh New sample script to run prepare.py across languages/tokenizers with byte-token reporting enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +244 to +248
_reset_byte_token_report(tokenizer)
val_ids = tokenizer.tokenize(val_data)
_print_byte_token_report(tokenizer, "val")
if args.output_tokenization_subdir:
_write_byte_token_report(tokenizer, "val", output_dir)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for validation: report file emission is gated on --output_tokenization_subdir, so --report_byte_tokenization with a separate --val_input still won’t produce a YAML report unless -s is set. Consider writing the report based on output paths rather than subdir mode.

Copilot uses AI. Check for mistakes.
Comment on lines 38 to +46
def record_token(self, token_id):
if self.token_counts is not None:
self.token_counts[token_id] += 1
if self.report_byte_tokenization and self.has_byte_tokens:
if self.is_byte_token(token_id):
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte-token reporting path (record_token + is_byte_token + get_byte_token_report) introduces new observable behavior, but the existing tokenizer test suite (data/template/tests.py) doesn’t cover it. Add a unit test that enables report_byte_tokenization, tokenizes a small known string with a byte-fallback tokenizer (and/or tiktoken), and asserts the returned counts/percentages are consistent and sum correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +90
fig = go.Figure()
fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)]))
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pie mode you’re summing per-report percentages (sum(byte_pct) / sum(non_byte_pct)), which makes the pie values scale with the number of reports (and won’t represent an “average” distribution). Compute an average percentage (or a weighted average using token totals from the reports) so the pie reflects a single 0–100% breakdown.

Suggested change
fig = go.Figure()
fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)]))
# Compute average percentages so the pie reflects a single 0–100% breakdown
if byte_pct:
avg_byte = sum(byte_pct) / len(byte_pct)
avg_non = sum(non_byte_pct) / len(non_byte_pct)
else:
avg_byte = 0.0
avg_non = 0.0
fig = go.Figure()
fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[avg_byte, avg_non]))

Copilot uses AI. Check for mistakes.
Comment on lines +153 to +155
counts_fig = _build_counts_figure(labels, byte_counts, non_byte_counts)
counts_png = output_dir / "byte_token_counts.png"
counts_fig.write_image(str(counts_png))
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_image() requires the optional kaleido dependency; the script currently always writes PNGs even when --write-html is used, so it will error in environments without kaleido (and requirements_cpu.txt doesn’t include it). Consider gating PNG export behind a flag, detecting kaleido and falling back to HTML-only, or updating requirements/documentation to include kaleido.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +95
parser.add_argument("--report_byte_tokenization", action="store_true",
help="Report byte vs non-byte token counts after tokenization (byte-fallback tokenizers only)")
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions a --byte_tokenization_report flag, but the CLI option added is --report_byte_tokenization. Please align the flag name with the documented interface (or update the PR description/docs) to avoid confusion for users and scripts.

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +229
_reset_byte_token_report(tokenizer)
train_ids = tokenizer.tokenize(train_data)
_print_byte_token_report(tokenizer, "train")
if args.output_tokenization_subdir:
_write_byte_token_report(tokenizer, "train", output_dir)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte-token YAML reports are only written when --output_tokenization_subdir is enabled. With --report_byte_tokenization alone, users get console output but no report file to plot/compare. Consider writing the report next to meta.pkl/outputs by default (e.g., derived from meta_output_path), or add an explicit --byte_token_report_dir option.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant