Add byte tokenization report #731

klei22 · 2026-01-26T04:42:36Z

Byte Tokenization Reports

These changes add a new feature --byte_tokenization_report to prepare.py that shows the amount and % of byte tokens vs non-byte tokens in the dataset.

Current sample scripts are located within the Flores-200 directory.

Testing:

cd data/flores200-res
bash get_dataset.sh
bash byte_report.sh
python3 utils/plot_byte_report.py tiktoken*/*.yaml --percent-plot stacked

Sample Graphs

Copilot

Pull request overview

Adds byte-tokenization reporting to the dataset preparation workflow and supporting scripts to generate/visualize byte vs non-byte token ratios.

Changes:

Add --report_byte_tokenization support to tokenizers + prepare.py, including per-split (train/val) report printing and optional YAML report writing.
Introduce a Plotly utility to visualize byte-token reports across runs.
Add/adjust Flores-200 helper scripts for generating byte-tokenization reports.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
data/template/utils/plot_byte_report.py	New Plotly-based report visualizer for YAML byte-tokenization outputs.
data/template/tokenizers.py	Adds byte-token accounting/reporting hooks and byte-token detection in relevant tokenizers; adjusts CharBPE JSON output path handling.
data/template/prepare.py	Adds CLI flag + helper functions to reset/print/write byte-tokenization reports during tokenization; ensures `.bin` writing respects `--skip_tokenization`.
data/flores200-res/get_dataset.sh	Removes one language entry from the sample dataset list.
data/flores200-res/byte_report.sh	New sample script to run prepare.py across languages/tokenizers with byte-token reporting enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T04:48:25Z

data/template/prepare.py

+            _reset_byte_token_report(tokenizer)
            val_ids = tokenizer.tokenize(val_data)
+            _print_byte_token_report(tokenizer, "val")
+            if args.output_tokenization_subdir:
+                _write_byte_token_report(tokenizer, "val", output_dir)


Same as above for validation: report file emission is gated on --output_tokenization_subdir, so --report_byte_tokenization with a separate --val_input still won’t produce a YAML report unless -s is set. Consider writing the report based on output paths rather than subdir mode.

Copilot · 2026-01-26T04:48:25Z

data/template/tokenizers.py

    def record_token(self, token_id):
        if self.token_counts is not None:
            self.token_counts[token_id] += 1
+        if self.report_byte_tokenization and self.has_byte_tokens:
+            if self.is_byte_token(token_id):


The byte-token reporting path (record_token + is_byte_token + get_byte_token_report) introduces new observable behavior, but the existing tokenizer test suite (data/template/tests.py) doesn’t cover it. Add a unit test that enables report_byte_tokenization, tokenizes a small known string with a byte-fallback tokenizer (and/or tiktoken), and asserts the returned counts/percentages are consistent and sum correctly.

Copilot · 2026-01-26T04:48:26Z

data/template/utils/plot_byte_report.py

+        fig = go.Figure()
+        fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)]))


In pie mode you’re summing per-report percentages (sum(byte_pct) / sum(non_byte_pct)), which makes the pie values scale with the number of reports (and won’t represent an “average” distribution). Compute an average percentage (or a weighted average using token totals from the reports) so the pie reflects a single 0–100% breakdown.

Suggested change

fig = go.Figure()

fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)]))

# Compute average percentages so the pie reflects a single 0–100% breakdown

if byte_pct:

avg_byte = sum(byte_pct) / len(byte_pct)

avg_non = sum(non_byte_pct) / len(non_byte_pct)

else:

avg_byte = 0.0

avg_non = 0.0

fig = go.Figure()

fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[avg_byte, avg_non]))

Copilot · 2026-01-26T04:48:26Z

data/template/utils/plot_byte_report.py

+    counts_fig = _build_counts_figure(labels, byte_counts, non_byte_counts)
+    counts_png = output_dir / "byte_token_counts.png"
+    counts_fig.write_image(str(counts_png))


write_image() requires the optional kaleido dependency; the script currently always writes PNGs even when --write-html is used, so it will error in environments without kaleido (and requirements_cpu.txt doesn’t include it). Consider gating PNG export behind a flag, detecting kaleido and falling back to HTML-only, or updating requirements/documentation to include kaleido.

Copilot · 2026-01-26T04:48:26Z

data/template/prepare.py

+    parser.add_argument("--report_byte_tokenization", action="store_true",
+                        help="Report byte vs non-byte token counts after tokenization (byte-fallback tokenizers only)")


PR description mentions a --byte_tokenization_report flag, but the CLI option added is --report_byte_tokenization. Please align the flag name with the documented interface (or update the PR description/docs) to avoid confusion for users and scripts.

Copilot · 2026-01-26T04:48:26Z

data/template/prepare.py

+        _reset_byte_token_report(tokenizer)
        train_ids = tokenizer.tokenize(train_data)
+        _print_byte_token_report(tokenizer, "train")
+        if args.output_tokenization_subdir:
+            _write_byte_token_report(tokenizer, "train", output_dir)


Byte-token YAML reports are only written when --output_tokenization_subdir is enabled. With --report_byte_tokenization alone, users get console output but no report file to plot/compare. Consider writing the report next to meta.pkl/outputs by default (e.g., derived from meta_output_path), or add an explicit --byte_token_report_dir option.

klei22 and others added 5 commits January 25, 2026 18:58

Extend byte token reporting to tiktoken

d8bde00

Write byte token report when using suffix

d8577d8

Write reports and vocab in output subdir

bf04b30

Update byte token reports and plotting

594eb14

Add byte token report generator

8a8e42e

klei22 requested review from Copilot and gkielian January 26, 2026 04:42

Copilot started reviewing on behalf of klei22 January 26, 2026 04:42 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

klei22 and others added 2 commits January 27, 2026 00:15

Update byte reports

605e0a1

Merge branch 'master' into add_byte_tokenization_report

0aa4236

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add byte tokenization report #731

Add byte tokenization report #731

Uh oh!

klei22 commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		fig = go.Figure()
		fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)]))

-        fig = go.Figure()
-        fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[sum(byte_pct), sum(non_byte_pct)]))
+        # Compute average percentages so the pie reflects a single 0–100% breakdown
+        if byte_pct:
+            avg_byte = sum(byte_pct) / len(byte_pct)
+            avg_non = sum(non_byte_pct) / len(non_byte_pct)
+        else:
+            avg_byte = 0.0
+            avg_non = 0.0
+        fig = go.Figure()
+        fig.add_trace(go.Pie(labels=["byte", "non-byte"], values=[avg_byte, avg_non]))

		parser.add_argument("--report_byte_tokenization", action="store_true",
		help="Report byte vs non-byte token counts after tokenization (byte-fallback tokenizers only)")

Add byte tokenization report #731

Are you sure you want to change the base?

Add byte tokenization report #731

Uh oh!

Conversation

klei22 commented Jan 26, 2026

Byte Tokenization Reports

Testing:

Sample Graphs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant