Bump tokenizers from 0.19.1 to 0.20.1 by dependabot[bot] · Pull Request #120 · frkri/ModelRunner

dependabot · 2024-10-10T13:55:58Z

Bumps tokenizers from 0.19.1 to 0.20.1.

Release notes

Release v0.20.1

What's Changed

The most awaited offset issue with Llama is fixed 🥳

Update README.md by @ArthurZucker in huggingface/tokenizers#1608

fix benchmark file link by @152334H in huggingface/tokenizers#1610

Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in huggingface/tokenizers#1626

[ignore_merges] Fix offsets by @ArthurZucker in huggingface/tokenizers#1640

Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1629

Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1630

Bump send and express in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1631

Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1641

Fix documentation build by @ArthurZucker in huggingface/tokenizers#1642

style: simplify string formatting for readability by @hamirmahal in huggingface/tokenizers#1632

New Contributors

@152334H made their first contribution in huggingface/tokenizers#1610

@hamirmahal made their first contribution in huggingface/tokenizers#1632

Full Changelog: huggingface/tokenizers@v0.20.0...v0.20.1

Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:
from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

... (truncated)

Commits

d98298a 0.20.1
de305f2 update to ubuntu-22.04
1053470 use --interpreter ${{ matrix.interpreter || '3.7 3.8 3.9 3.10 3.11 3.12 pypy3...
f7c33eb add Cargo
eca17be v 0.20.1-rc1
557fde7 style: simplify string formatting for readability (#1632)
3d51a16 Fix documentation build (#1642)
294ab86 Bump webpack in /tokenizers/examples/unstable_wasm/www (#1641)
2b97a5e Bump send and express in /tokenizers/examples/unstable_wasm/www (#1631)
077678d Bump serve-static and express in /tokenizers/examples/unstable_wasm/www (#1630)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [tokenizers](https://github.com/huggingface/tokenizers) from 0.19.1 to 0.20.1. - [Release notes](https://github.com/huggingface/tokenizers/releases) - [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md) - [Commits](huggingface/tokenizers@v0.19.1...v0.20.1) --- updated-dependencies: - dependency-name: tokenizers dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

dependabot · 2024-11-05T13:35:34Z

Superseded by #127.

dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update Rust code labels Oct 10, 2024

dependabot Bot mentioned this pull request Oct 10, 2024

Bump tokenizers from 0.19.1 to 0.20.0 #107

Closed

dependabot Bot closed this Nov 5, 2024

dependabot Bot deleted the dependabot/cargo/tokenizers-0.20.1 branch November 5, 2024 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump tokenizers from 0.19.1 to 0.20.1#120

Bump tokenizers from 0.19.1 to 0.20.1#120
dependabot[bot] wants to merge 1 commit into
masterfrom
dependabot/cargo/tokenizers-0.20.1

dependabot Bot commented on behalf of github Oct 10, 2024

Uh oh!

dependabot Bot commented on behalf of github Nov 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

dependabot Bot commented on behalf of github Oct 10, 2024

Release v0.20.1

What's Changed

New Contributors

Release v0.20.0: faster encode, better python support

Release v0.20.0

Performances:

Python API

Uh oh!

dependabot Bot commented on behalf of github Nov 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants