pyo3_runtime.PanicException: AddedVocabulary bad split

The following code triggered pyo3_runtime.PanicException: AddedVocabulary bad split

```
from transformers import pipeline
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws-xiandai")

def word_segment(sentence):
    segmented = classifier(sentence)
    sentence = []
    for word in segmented:
        sentence.append(word['word'])
    return sentence

print(word_segment("我想去吃飯"))
```

(both transformers 4.22.1 and 4.30.0)





thread '<unnamed>' panicked at 'AddedVocabulary bad split', tokenizers-lib/src/tokenizer/added_vocabulary.rs:360:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 104, in <module>
  File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 98, in word_segment
    for word in segmented:
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 192, in __call__
    return super().__call__(inputs, **kwargs)
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1074, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1080, in run_single
    model_inputs = self.preprocess(inputs, **preprocess_params)
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 196, in preprocess
    model_inputs = self.tokenizer(
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2484, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2590, in _call_one
    return self.encode_plus(
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2663, in encode_plus
    return self._encode_plus(
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 500, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 427, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: AddedVocabulary bad split



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyo3_runtime.PanicException: AddedVocabulary bad split #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

pyo3_runtime.PanicException: AddedVocabulary bad split #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions