-
Notifications
You must be signed in to change notification settings - Fork 27
Add byte tokenization report #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds byte-tokenization reporting to the dataset preparation workflow and supporting scripts to generate/visualize byte vs non-byte token ratios.
Changes:
- Add
--report_byte_tokenizationsupport to tokenizers +prepare.py, including per-split (train/val) report printing and optional YAML report writing. - Introduce a Plotly utility to visualize byte-token reports across runs.
- Add/adjust Flores-200 helper scripts for generating byte-tokenization reports.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| data/template/utils/plot_byte_report.py | New Plotly-based report visualizer for YAML byte-tokenization outputs. |
| data/template/tokenizers.py | Adds byte-token accounting/reporting hooks and byte-token detection in relevant tokenizers; adjusts CharBPE JSON output path handling. |
| data/template/prepare.py | Adds CLI flag + helper functions to reset/print/write byte-tokenization reports during tokenization; ensures .bin writing respects --skip_tokenization. |
| data/flores200-res/get_dataset.sh | Removes one language entry from the sample dataset list. |
| data/flores200-res/byte_report.sh | New sample script to run prepare.py across languages/tokenizers with byte-token reporting enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _reset_byte_token_report(tokenizer) | ||
| val_ids = tokenizer.tokenize(val_data) | ||
| _print_byte_token_report(tokenizer, "val") | ||
| if args.output_tokenization_subdir: | ||
| _write_byte_token_report(tokenizer, "val", output_dir) |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above for validation: report file emission is gated on --output_tokenization_subdir, so --report_byte_tokenization with a separate --val_input still won’t produce a YAML report unless -s is set. Consider writing the report based on output paths rather than subdir mode.
| def record_token(self, token_id): | ||
| if self.token_counts is not None: | ||
| self.token_counts[token_id] += 1 | ||
| if self.report_byte_tokenization and self.has_byte_tokens: | ||
| if self.is_byte_token(token_id): |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The byte-token reporting path (record_token + is_byte_token + get_byte_token_report) introduces new observable behavior, but the existing tokenizer test suite (data/template/tests.py) doesn’t cover it. Add a unit test that enables report_byte_tokenization, tokenizes a small known string with a byte-fallback tokenizer (and/or tiktoken), and asserts the returned counts/percentages are consistent and sum correctly.
| fig = go.Figure() | ||
| fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)])) |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pie mode you’re summing per-report percentages (sum(byte_pct) / sum(non_byte_pct)), which makes the pie values scale with the number of reports (and won’t represent an “average” distribution). Compute an average percentage (or a weighted average using token totals from the reports) so the pie reflects a single 0–100% breakdown.
| fig = go.Figure() | |
| fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)])) | |
| # Compute average percentages so the pie reflects a single 0–100% breakdown | |
| if byte_pct: | |
| avg_byte = sum(byte_pct) / len(byte_pct) | |
| avg_non = sum(non_byte_pct) / len(non_byte_pct) | |
| else: | |
| avg_byte = 0.0 | |
| avg_non = 0.0 | |
| fig = go.Figure() | |
| fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[avg_byte, avg_non])) |
| counts_fig = _build_counts_figure(labels, byte_counts, non_byte_counts) | ||
| counts_png = output_dir / "byte_token_counts.png" | ||
| counts_fig.write_image(str(counts_png)) |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_image() requires the optional kaleido dependency; the script currently always writes PNGs even when --write-html is used, so it will error in environments without kaleido (and requirements_cpu.txt doesn’t include it). Consider gating PNG export behind a flag, detecting kaleido and falling back to HTML-only, or updating requirements/documentation to include kaleido.
| parser.add_argument("--report_byte_tokenization", action="store_true", | ||
| help="Report byte vs non-byte token counts after tokenization (byte-fallback tokenizers only)") |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description mentions a --byte_tokenization_report flag, but the CLI option added is --report_byte_tokenization. Please align the flag name with the documented interface (or update the PR description/docs) to avoid confusion for users and scripts.
| _reset_byte_token_report(tokenizer) | ||
| train_ids = tokenizer.tokenize(train_data) | ||
| _print_byte_token_report(tokenizer, "train") | ||
| if args.output_tokenization_subdir: | ||
| _write_byte_token_report(tokenizer, "train", output_dir) |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Byte-token YAML reports are only written when --output_tokenization_subdir is enabled. With --report_byte_tokenization alone, users get console output but no report file to plot/compare. Consider writing the report next to meta.pkl/outputs by default (e.g., derived from meta_output_path), or add an explicit --byte_token_report_dir option.
Byte Tokenization Reports
These changes add a new feature
--byte_tokenization_reportto prepare.py that shows the amount and % of byte tokens vs non-byte tokens in the dataset.Current sample scripts are located within the Flores-200 directory.
Testing:
Sample Graphs