- Python: 3.10
pipenv installpipenv run python run.py- Data1:
- wikipedia: 15 documents sampled from Japanese Wikipedia.
- cc100: 15 documents sampled from CC-100 (web text).
- emoji
- Example input:
"もちろん大丈夫です👍よろしくお願いします。" - Expected output:
["もちろん大丈夫です👍", "よろしくお願いします。"]
- Example input:
- kaomoji
- Example input:
"いいですよ^^よろしくお願いします。" - Expected output:
["いいですよ^^", "よろしくお願いします。"]
- Example input:
- named_entity
- Example input:
"モーニング娘。は日本のアイドルグループです。" - Expected output:
["モーニング娘。は日本のアイドルグループです。"]
- Example input:
- new_line
- Example input:
"時間は現在調整中ですので決まり次第\nご連絡差し上げます。" - Expected output:
["時間は現在調整中ですので決まり次第\nご連絡差し上げます。"]
- Example input:
- Evaluation metric: F1 (micro average)
| Tool | Method | wikipedia | cc100 | emoji | kaomoji | named_entity | new_line |
|---|---|---|---|---|---|---|---|
| pysbd | Rule-based | 100.0 | 85.5 | 0.0 | 0.0 | 0.0 | 44.4 |
| rhoknp | Rule-based | 100.0 | 88.4 | 0.0 | 0.0 | 0.0 | 44.4 |
| kuzukiri | Rule-based | 100.0 | 85.3 | 0.0 | 0.0 | 72.7 | 44.4 |
| hasami | Rule-based | 94.8 | 86.2 | 0.0 | 0.0 | 72.7 | 44.4 |
| sengiri | Rule-based | 55.7 | 68.1 | 12.9 | 0.0 | 56.0 | 44.4 |
| bunkai | Rule-based + Model-based | 93.7 | 83.7 | 100.0 | 66.7 | 0.0 | 100.0 |
| ginza (ja_ginza_electra) | Model-based | 95.7 | 85.7 | 66.7 | 84.2 | 75.0 | 70.0 |
Footnotes
-
Annotation has been done by the repository owner. ↩