Tags are joined with a comma and padded with asterisks#3491
Conversation
|
Thanks for tagging me on this. I think there is an issue with your Mecab configuration and this PR should not be accepted, but it raises a point that can be fixed in the spaCy use of Mecab. The format with the dashes is the This format was used in the draft Unidic/UD mapping I received in 2017 and mentioned in the paper Universal Dependencies Version 2 for Japanese (there's not a correspondence table, but look for "名詞" in the text). Can you post the output of Because Unidic ships a default format and it's not obvious how to change it (see taku910/mecab#38) I didn't think to set the format explicitly in spaCy when loading Mecab but that would be a good idea. I'll see about adding it, it should just involve passing a format string to the Tagger when it's created. Looking at Unidic v2.3.0, the latest version, there's a |
|
Thank you for your prompt review. The output of I haven't changed I looked at my Ubuntu/18.04 environment and confirmed that it also had exact the same output from By the way, I had to manually downgrade unidic-mecab on Ubuntu as the latest |
|
It looks to me that spaCy/spacy/lang/ja/__init__.py Lines 52 to 68 in 9e14b2b |
|
You're absolutely right. I confirmed that the tests are currently failing but run correctly with your patch. I guess there must have been a formatting change at some point and this test wasn't updated... Thanks for catching it! |
|
@HiromuHota @polm Thanks for your work on Japanese and the analysis! So just to confirm: this PR should be merged then, right? Btw, in case you haven't seen it, I ended up making a small modification to the way the Mecab tags are stored on the get_mecab_tag = lambda token: token.doc.user_data["mecab_tags"][token.i]
Token.set_extension('mecab_tag', getter=get_mecab_tag) |
|
I think this PR is good to merge 👍 Also noted on the tags change, that looks good too, thanks for the heads up! |
Description
Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.
Types of change
Bug fix
Checklist