Support for CLIP tokenizers from Hugging Face

Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main

I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta
However, export_vocab script expects `Ġ` prefix in the vocabulary. CLIP's vocabulary uses `</w>` as a **suffix** and not a **prefix**.
I tried to modify the script to detect ending `</w>` instead of `Ġ` to append `0x2581`: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91
but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:

Input string: `"a photo of a really, functistaner big cat."`
```
Hugging faces:
49406, 320, 1125, 539, 320, 1414, 267, 8679, 555, 2203, 528, 1205, 2368, 269, 49407]
BlingFire:
320   1125   539    320  1414   11 1499 66   555     2203    517     1205    2368    13 
```

Is there some way to make BlingFire support CLIP version of tokenizer?

My current scripts and reproduction steps:
https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for CLIP tokenizers from Hugging Face #173

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for CLIP tokenizers from Hugging Face #173

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions