Tags are joined with a comma and padded with asterisks by HiromuHota · Pull Request #3491 · explosion/spaCy

HiromuHota · 2019-03-26T17:47:49Z

Description

Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.

Types of change

Bug fix

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

polm · 2019-03-27T14:29:51Z

Thanks for tagging me on this. I think there is an issue with your Mecab configuration and this PR should not be accepted, but it raises a point that can be fixed in the spaCy use of Mecab.

The format with the dashes is the unidic output format for Mecab and is specified with the dicrc file distributed with Unidic (for v2.1.2 anyway...). In the dicrc on my system it looks like this:

node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n

This format was used in the draft Unidic/UD mapping I received in 2017 and mentioned in the paper Universal Dependencies Version 2 for Japanese (there's not a correspondence table, but look for "名詞" in the text).

Can you post the output of mecab -P on your system? Your output looks like the raw lex.csv data from Unidic, which corresponds to %H in a Mecab output format string. Maybe your operating system package modified Unidic's dicrc?

Because Unidic ships a default format and it's not obvious how to change it (see taku910/mecab#38) I didn't think to set the format explicitly in spaCy when loading Mecab but that would be a good idea. I'll see about adding it, it should just involve passing a format string to the Tagger when it's created.

Looking at Unidic v2.3.0, the latest version, there's a unidic22 format which is closer to the CSV format but not the same, but this format doesn't seem to be referenced in any official UD docs I've found, so I'll need to look into that separately.

HiromuHota · 2019-03-27T16:33:30Z

Thank you for your prompt review.

The output of mecab -P is as follows:

$ mecab -P
bos-feature: BOS/EOS,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
bos-format: 
bos-format-chamame: B
bos-format-unidic: 
cost-factor: 700
dicdir: /usr/local/lib/mecab/dic/unidic
dump-config: 1
eon-format: 
eos-format: EOS\n
eos-format-chamame: 
eos-format-unidic: EOS\n
lattice-level: 0
max-grouping-size: 24
nbest: 1
node-format: %m\t%H\n
node-format-chamame: \t%m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
node-format-unidic: %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
output-format-type: unidic
theta: 0.75
unk-format: %m\t%H\n
unk-format-chamame: \t%m\t\t\t%m\t%F-[0,1,2,3]\t\t\n
unk-format-unidic: %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n

I haven't changed dicrc at all. I've installed mecab-unidic through Brew.

$ brew info mecab-unidic
mecab-unidic: stable 2.1.2 (bottled)
Morphological analyzer for MeCab
https://osdn.net/projects/unidic/
/usr/local/Cellar/mecab-unidic/2.1.2 (12 files, 196.2MB) *
  Poured from bottle on 2019-03-25 at 14:19:26
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/mecab-unidic.rb
==> Dependencies
Required: mecab ✔
==> Caveats
To enable mecab-unidic dictionary, add to /usr/local/etc/mecabrc:
  dicdir = /usr/local/lib/mecab/dic/unidic
==> Analytics
install: 13 (30 days), 39 (90 days), 193 (365 days)
install_on_request: 13 (30 days), 39 (90 days), 192 (365 days)
build_error: 0 (30 days)

I looked at my Ubuntu/18.04 environment and confirmed that it also had exact the same output from mecab -P (except dicdir: /usr/local/lib/mecab/dic/unidic on macOS and /var/lib/mecab/dic/debian, which is symbolic-linked to /var/lib/mecab/dic/unidic, on Ubuntu).

By the way, I had to manually downgrade unidic-mecab on Ubuntu as the latest 2.2.0-1 for 18.04 does not include binary dictionary files (see here for details).

$ sudo apt-cache policy unidic-mecab
unidic-mecab:
  Installed: 2.1.2~dfsg-7
  Candidate: 2.2.0-1
  Version table:
     2.2.0-1 500
        500 http://us.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages
        500 http://us.archive.ubuntu.com/ubuntu bionic/universe i386 Packages
 *** 2.1.2~dfsg-7 100
        100 /var/lib/dpkg/status

HiromuHota · 2019-03-27T18:02:37Z

It looks to me that spacy.lang.ja.detailed_tokens joins tags, returned from MeCab, with a comma (i.e., pos = ",".join(parts[0:4])).

spaCy/spacy/lang/ja/__init__.py

Lines 52 to 68 in 9e14b2b

    
           def detailed_tokens(tokenizer, text): 
        
               """Format Mecab output into a nice data structure, based on Janome.""" 
        
               node = tokenizer.parseToNode(text) 
        
               node = node.next  # first node is beginning of sentence and empty, skip it 
        
               words = [] 
        
               while node.posid != 0: 
        
                   surface = node.surface 
        
                   base = surface  # a default value. Updated if available later. 
        
                   parts = node.feature.split(",") 
        
                   pos = ",".join(parts[0:4]) 
        
                   if len(parts) > 7: 
        
                       # this information is only available for words in the tokenizer 
        
                       # dictionary 
        
                       base = parts[7] 
        
                   words.append(ShortUnitWord(surface, base, pos)) 
        
                   node = node.next 
        
               return words

polm · 2019-03-28T08:28:25Z

You're absolutely right. parseToNode is used to get data from Mecab in order to avoid issues with formatting incosistency, but I forgot the POS tags are passed in as fields rather than as a single unit as well, so the node-format isn't relevant.

I confirmed that the tests are currently failing but run correctly with your patch. I guess there must have been a formatting change at some point and this test wasn't updated... Thanks for catching it!

ines · 2019-03-28T11:54:46Z

@HiromuHota @polm Thanks for your work on Japanese and the analysis! So just to confirm: this PR should be merged then, right?

Btw, in case you haven't seen it, I ended up making a small modification to the way the Mecab tags are stored on the Doc: 2912ddc. This means we don't have to set an extension attribute in the global scope (which can cause problems with reloading). But if a user wants the token._.mecab attribute, they can add it themselves with a getter:

get_mecab_tag = lambda token: token.doc.user_data["mecab_tags"][token.i]
Token.set_extension('mecab_tag', getter=get_mecab_tag)

polm · 2019-03-28T15:15:10Z

I think this PR is good to merge 👍

Also noted on the tags change, that looks good too, thanks for the heads up!

Tags are joined with a comma and padded with asterisks

d1af0e4

ines added bug Bugs and behaviour differing from documentation lang / ja Japanese language data and models tests New, missing or incorrect tests labels Mar 26, 2019

polm mentioned this pull request Mar 27, 2019

Is it possible to unset output-format-type? SamuraiT/mecab-python3#25

Closed

ines merged commit 914b9ff into explosion:master Mar 28, 2019

HiromuHota deleted the fix/JapaneseTest branch March 28, 2019 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tags are joined with a comma and padded with asterisks#3491

Tags are joined with a comma and padded with asterisks#3491
ines merged 1 commit into
explosion:masterfrom
HiromuHota:fix/JapaneseTest

HiromuHota commented Mar 26, 2019 •

edited

Loading

Uh oh!

polm commented Mar 27, 2019 •

edited

Loading

Uh oh!

HiromuHota commented Mar 27, 2019

Uh oh!

HiromuHota commented Mar 27, 2019

Uh oh!

polm commented Mar 28, 2019

Uh oh!

ines commented Mar 28, 2019

Uh oh!

polm commented Mar 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

HiromuHota commented Mar 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of change

Checklist

Uh oh!

polm commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HiromuHota commented Mar 27, 2019

Uh oh!

HiromuHota commented Mar 27, 2019

Uh oh!

polm commented Mar 28, 2019

Uh oh!

ines commented Mar 28, 2019

Uh oh!

polm commented Mar 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HiromuHota commented Mar 26, 2019 •

edited

Loading

polm commented Mar 27, 2019 •

edited

Loading