-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main
I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta
However, export_vocab script expects Ġ prefix in the vocabulary. CLIP's vocabulary uses </w> as a suffix and not a prefix.
I tried to modify the script to detect ending </w> instead of Ġ to append 0x2581: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91
but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:
Input string: "a photo of a really, functistaner big cat."
Hugging faces:
49406, 320, 1125, 539, 320, 1414, 267, 8679, 555, 2203, 528, 1205, 2368, 269, 49407]
BlingFire:
320 1125 539 320 1414 11 1499 66 555 2203 517 1205 2368 13
Is there some way to make BlingFire support CLIP version of tokenizer?
My current scripts and reproduction steps:
https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip