Skip to content

Use UTF8Tokenizer #22

@AmitMY

Description

@AmitMY

hi! i'd like to suggest that instead of https://github.com/goombalab/hnet/blob/main/hnet/utils/tokenizers.py
You use https://github.com/sign/utf8-tokenizer (install with pip)

It comes with a fast implementation for torch, and a more standardized special tokens setup.

For example, you are using bytes 254 and 255 for bos and eos, but these are not valid when decoding utf-8 (since they start a 7 and 8 bytes sequence iirc), so instead we use 02 and 03 which are the defined controls for string start and end.

the package also comes with fun features like bit-biasing which can help training

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions