Skip to content

Different output formats possible via constructor arguments? #99

Description

@jzohrab

Hello, thank you very much for your work on this project. I'm using MeCab for a language-learning program, and would like to use this library if possible.

The mecab binary allowed for some arguments to be passed which would affect its output. For example:

$ mecab -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n
太郎はこの本を女性に渡した。
太郎	2	44
は	6	16
この	6	68
本	2	38
を	6	13
女性	2	38
に	6	13
渡し	2	31
た	6	25
。	3	7
EOP	3	7

Is there a way to get the same with this python library? I tried some obvious attempts, e.g.

import MeCab
t = MeCab.Tagger('-F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n -r ./mecabrc_dummy.txt -d ./.venv/lib/python3.11/site-packages/unidic_lite/dicdir')   # also tried single \ instead of \\
sentence = "太郎はこの本を女性に渡した。"
print(t.parse(sentence))

but this still outputs the same as the default Tagger output:

$ python main.py 
太郎	タロー	タロウ	タロウ	名詞-固有名詞-人名-名			1
は	ワ	ハ	は	助詞-係助詞			
この	コノ	コノ	此の	連体詞			0
...
渡し	ワタシ	ワタス	渡す	動詞-一般	五段-サ行	連用形-一般	0
た	タ	タ	た	助動詞	助動詞-タ	終止形-一般	
。			。	補助記号-句点			
EOS

I edited unidic_lite/dicdir/dicrc:

output-format-type = custom

; output custom - new three-column output
node-format-custom = %m\t%t\t%h\n
unk-format-custom  = %m\t%t\t%h\n
bos-format-custom  =
eos-format-custom  = EOP\t3\t7\n

With that, the output was more or less what I expected (the third column is different, but that doesn't matter):

$ python main.py 
太郎	2	1
は	6	1
この	6	1
本	2	1
...
た	6	1
。	3	1
EOP	3	7

I did try with unidic, instead of unidic_lite,

t = MeCab.Tagger('-r ./mecabrc_dummy.txt -d ./.venv/lib/python3.11/site-packages/unidic/dicdir -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n')

and got the default unidic output:

太郎	名詞,固有名詞,人名,名,,,タロウ,タロウ,太郎,タロー,太郎,タロー,固,"","","","","","",名,タロウ,タロウ,タロウ,タロウ,"1","","",6252931250790912,22748
は	助詞,係助詞,,,,,ハ,は,は,ワ,は,ワ,和,"","","","","","",係助,ハ,ハ,ハ,ハ,"","動詞%F2@0,名詞%F1,形容詞%F2@-1","",8059703733133824,29321
この	連体詞,,,,,,コノ,此の,この,コノ,この,コノ,和,"","","","","","",相,コノ,コノ,コノ,コノ,"0","","",3547308012741120,12905
...
。	補助記号,句点,,,,,,。,。,,。,,記号,"","","","","","",補助,,,,,"","","",6880571302400,25
EOS

Thank you again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions