Skip to content

Support for CLIP tokenizers from Hugging Face #173

@dkalinowski

Description

@dkalinowski

Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main

I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta
However, export_vocab script expects Ġ prefix in the vocabulary. CLIP's vocabulary uses </w> as a suffix and not a prefix.
I tried to modify the script to detect ending </w> instead of Ġ to append 0x2581: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91
but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:

Input string: "a photo of a really, functistaner big cat."

Hugging faces:
49406, 320, 1125, 539, 320, 1414, 267, 8679, 555, 2203, 528, 1205, 2368, 269, 49407]
BlingFire:
320   1125   539    320  1414   11 1499 66   555     2203    517     1205    2368    13 

Is there some way to make BlingFire support CLIP version of tokenizer?

My current scripts and reproduction steps:
https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions