Skip to content

Tags are joined with a comma and padded with asterisks#3491

Merged
ines merged 1 commit into
explosion:masterfrom
HiromuHota:fix/JapaneseTest
Mar 28, 2019
Merged

Tags are joined with a comma and padded with asterisks#3491
ines merged 1 commit into
explosion:masterfrom
HiromuHota:fix/JapaneseTest

Conversation

@HiromuHota

@HiromuHota HiromuHota commented Mar 26, 2019

Copy link
Copy Markdown
Contributor

Description

Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.

Types of change

Bug fix

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added bug Bugs and behaviour differing from documentation lang / ja Japanese language data and models tests New, missing or incorrect tests labels Mar 26, 2019
@polm

polm commented Mar 27, 2019

Copy link
Copy Markdown
Contributor

Thanks for tagging me on this. I think there is an issue with your Mecab configuration and this PR should not be accepted, but it raises a point that can be fixed in the spaCy use of Mecab.

The format with the dashes is the unidic output format for Mecab and is specified with the dicrc file distributed with Unidic (for v2.1.2 anyway...). In the dicrc on my system it looks like this:

node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n

This format was used in the draft Unidic/UD mapping I received in 2017 and mentioned in the paper Universal Dependencies Version 2 for Japanese (there's not a correspondence table, but look for "名詞" in the text).

Can you post the output of mecab -P on your system? Your output looks like the raw lex.csv data from Unidic, which corresponds to %H in a Mecab output format string. Maybe your operating system package modified Unidic's dicrc?

Because Unidic ships a default format and it's not obvious how to change it (see taku910/mecab#38) I didn't think to set the format explicitly in spaCy when loading Mecab but that would be a good idea. I'll see about adding it, it should just involve passing a format string to the Tagger when it's created.

Looking at Unidic v2.3.0, the latest version, there's a unidic22 format which is closer to the CSV format but not the same, but this format doesn't seem to be referenced in any official UD docs I've found, so I'll need to look into that separately.

@HiromuHota

Copy link
Copy Markdown
Contributor Author

Thank you for your prompt review.

The output of mecab -P is as follows:

$ mecab -P
bos-feature: BOS/EOS,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
bos-format: 
bos-format-chamame: B
bos-format-unidic: 
cost-factor: 700
dicdir: /usr/local/lib/mecab/dic/unidic
dump-config: 1
eon-format: 
eos-format: EOS\n
eos-format-chamame: 
eos-format-unidic: EOS\n
lattice-level: 0
max-grouping-size: 24
nbest: 1
node-format: %m\t%H\n
node-format-chamame: \t%m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
node-format-unidic: %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
output-format-type: unidic
theta: 0.75
unk-format: %m\t%H\n
unk-format-chamame: \t%m\t\t\t%m\t%F-[0,1,2,3]\t\t\n
unk-format-unidic: %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n

I haven't changed dicrc at all. I've installed mecab-unidic through Brew.

$ brew info mecab-unidic
mecab-unidic: stable 2.1.2 (bottled)
Morphological analyzer for MeCab
https://osdn.net/projects/unidic/
/usr/local/Cellar/mecab-unidic/2.1.2 (12 files, 196.2MB) *
  Poured from bottle on 2019-03-25 at 14:19:26
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/mecab-unidic.rb
==> Dependencies
Required: mecab ✔
==> Caveats
To enable mecab-unidic dictionary, add to /usr/local/etc/mecabrc:
  dicdir = /usr/local/lib/mecab/dic/unidic
==> Analytics
install: 13 (30 days), 39 (90 days), 193 (365 days)
install_on_request: 13 (30 days), 39 (90 days), 192 (365 days)
build_error: 0 (30 days)

I looked at my Ubuntu/18.04 environment and confirmed that it also had exact the same output from mecab -P (except dicdir: /usr/local/lib/mecab/dic/unidic on macOS and /var/lib/mecab/dic/debian, which is symbolic-linked to /var/lib/mecab/dic/unidic, on Ubuntu).

By the way, I had to manually downgrade unidic-mecab on Ubuntu as the latest 2.2.0-1 for 18.04 does not include binary dictionary files (see here for details).

$ sudo apt-cache policy unidic-mecab
unidic-mecab:
  Installed: 2.1.2~dfsg-7
  Candidate: 2.2.0-1
  Version table:
     2.2.0-1 500
        500 http://us.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages
        500 http://us.archive.ubuntu.com/ubuntu bionic/universe i386 Packages
 *** 2.1.2~dfsg-7 100
        100 /var/lib/dpkg/status

@HiromuHota

Copy link
Copy Markdown
Contributor Author

It looks to me that spacy.lang.ja.detailed_tokens joins tags, returned from MeCab, with a comma (i.e., pos = ",".join(parts[0:4])).

def detailed_tokens(tokenizer, text):
"""Format Mecab output into a nice data structure, based on Janome."""
node = tokenizer.parseToNode(text)
node = node.next # first node is beginning of sentence and empty, skip it
words = []
while node.posid != 0:
surface = node.surface
base = surface # a default value. Updated if available later.
parts = node.feature.split(",")
pos = ",".join(parts[0:4])
if len(parts) > 7:
# this information is only available for words in the tokenizer
# dictionary
base = parts[7]
words.append(ShortUnitWord(surface, base, pos))
node = node.next
return words

@polm

polm commented Mar 28, 2019

Copy link
Copy Markdown
Contributor

You're absolutely right. parseToNode is used to get data from Mecab in order to avoid issues with formatting incosistency, but I forgot the POS tags are passed in as fields rather than as a single unit as well, so the node-format isn't relevant.

I confirmed that the tests are currently failing but run correctly with your patch. I guess there must have been a formatting change at some point and this test wasn't updated... Thanks for catching it!

@ines

ines commented Mar 28, 2019

Copy link
Copy Markdown
Member

@HiromuHota @polm Thanks for your work on Japanese and the analysis! So just to confirm: this PR should be merged then, right?

Btw, in case you haven't seen it, I ended up making a small modification to the way the Mecab tags are stored on the Doc: 2912ddc. This means we don't have to set an extension attribute in the global scope (which can cause problems with reloading). But if a user wants the token._.mecab attribute, they can add it themselves with a getter:

get_mecab_tag = lambda token: token.doc.user_data["mecab_tags"][token.i]
Token.set_extension('mecab_tag', getter=get_mecab_tag)

@polm

polm commented Mar 28, 2019

Copy link
Copy Markdown
Contributor

I think this PR is good to merge 👍

Also noted on the tags change, that looks good too, thanks for the heads up!

@ines ines merged commit 914b9ff into explosion:master Mar 28, 2019
@HiromuHota HiromuHota deleted the fix/JapaneseTest branch March 28, 2019 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Bugs and behaviour differing from documentation lang / ja Japanese language data and models tests New, missing or incorrect tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants