Skip to content

pyo3_runtime.PanicException: AddedVocabulary bad split #1

@kalvinchang

Description

@kalvinchang

The following code triggered pyo3_runtime.PanicException: AddedVocabulary bad split

from transformers import pipeline
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws-xiandai")

def word_segment(sentence):
    segmented = classifier(sentence)
    sentence = []
    for word in segmented:
        sentence.append(word['word'])
    return sentence

print(word_segment("我想去吃飯"))

(both transformers 4.22.1 and 4.30.0)

thread '' panicked at 'AddedVocabulary bad split', tokenizers-lib/src/tokenizer/added_vocabulary.rs:360:22
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Traceback (most recent call last):
File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 104, in
File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 98, in word_segment
for word in segmented:
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 192, in call
return super().call(inputs, **kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1074, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1080, in run_single
model_inputs = self.preprocess(inputs, **preprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 196, in preprocess
model_inputs = self.tokenizer(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2484, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2590, in _call_one
return self.encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2663, in encode_plus
return self._encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 500, in _encode_plus
batched_output = self._batch_encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 427, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: AddedVocabulary bad split

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions