hi! i'd like to suggest that instead of https://github.com/goombalab/hnet/blob/main/hnet/utils/tokenizers.py
You use https://github.com/sign/utf8-tokenizer (install with pip)
It comes with a fast implementation for torch, and a more standardized special tokens setup.
For example, you are using bytes 254 and 255 for bos and eos, but these are not valid when decoding utf-8 (since they start a 7 and 8 bytes sequence iirc), so instead we use 02 and 03 which are the defined controls for string start and end.
the package also comes with fun features like bit-biasing which can help training
hi! i'd like to suggest that instead of https://github.com/goombalab/hnet/blob/main/hnet/utils/tokenizers.py
You use https://github.com/sign/utf8-tokenizer (install with pip)
It comes with a fast implementation for torch, and a more standardized special tokens setup.
For example, you are using bytes 254 and 255 for bos and eos, but these are not valid when decoding utf-8 (since they start a 7 and 8 bytes sequence iirc), so instead we use 02 and 03 which are the defined controls for string start and end.
the package also comes with fun features like bit-biasing which can help training