Use UTF8Tokenizer

hi! i'd like to suggest that instead of https://github.com/goombalab/hnet/blob/main/hnet/utils/tokenizers.py
You use https://github.com/sign/utf8-tokenizer (install with pip)

It comes with a fast implementation for torch, and a more standardized special tokens setup.

For example, you are using bytes 254 and 255 for bos and eos, but these are not valid when decoding utf-8 (since they start a 7 and 8 bytes sequence iirc), so instead we use 02 and 03 which are the defined controls for string start and end.

the package also comes with fun features like bit-biasing which can help training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use UTF8Tokenizer #22

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Use UTF8Tokenizer #22

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions