Conversation
|
The link problem is a feature of mediawiki called blend link but the Japanese text doesn't behave like this somehow. It's also called "Word-ending links" at here https://www.mediawiki.org/wiki/Help:Links#Internal_links, the document says these links 'Follow so-called "linktrail rules" localised per each language.', no idea what's linktrail rules. |
|
I think this is the linktrail rule for English: https://github.com/wikimedia/mediawiki/blob/0f67e1045f35a8c854f3c6a3e2712c2d8b0b54d6/languages/messages/MessagesEn.php#L571-L575 MessagesJa.php file doesn't have this rule. |
|
Interesting, I didn't know. I wonder how to approach that part. It makes sense to not have the blend for Japanese (and probably Chinese and other languages without space separation), or any link would result in the remainder of the sentence being highlighted for no reason. I also noticed that I'm not entirely sure if the desired output should have been Edit: that is, doing something like (with some safety checks for the index): form_of = clean_node(wxr, None, link_node.largs[0]) # This extracts appellareEdit2: the testsuite passes with that change, yet, as stated above, I am worried about introducing a regression in quality, even though it works for this case. |
|
IMO the displayed text "appellāre" should be used and it probably should be fixed somewhere inside "clean_node()". |
|
The alternative for comparison (the idea above of using largs with checks): def clean_node_no_trail(wxr: WiktextractContext, node: WikiNode) -> str:
assert node.largs
if len(node.largs) >= 2:
# [[泣く|泣き]]易い人 > 泣き
return clean_node(wxr, None, node.largs[1])
elif len(node.largs) == 1:
# [[感情]]が傷つき > 感情
return clean_node(wxr, None, node.largs[0])
else:
print("WARN: called clean_node_no_trail with no largs")
return ""
def find_form_of_data(
wxr: WiktextractContext,
word_entry: WordEntry,
sense: Sense,
list_item_node: WikiNode,
) -> None:
for node in list_item_node.find_child(NodeKind.TEMPLATE):
if node.template_name.endswith(" of"):
expanded_node = wxr.wtp.parse(
wxr.wtp.node_to_wikitext(node), expand_all=True
)
for link_node in expanded_node.find_child_recursively(
NodeKind.LINK
):
form_of = clean_node(wxr, None, link_node)
if form_of != "":
sense.form_of.append(AltForm(word=form_of))
break
if "form-of" in word_entry.tags and len(sense.form_of) == 0:
for link_node in list_item_node.find_child(NodeKind.LINK):
form_of = clean_node_no_trail(wxr, link_node) # < CHANGE HERE
if form_of != "":
sense.form_of.append(AltForm(word=form_of))
sense.tags.append("form-of")
break |
|
Sorry I misjudged the problem, I think the bug is in the wikitextprocessor package, the tail text should not be parsed into a link node if it doesn't match "[a-z]+". Code at here https://github.com/tatuylonen/wikitextprocessor/blob/59dc20b855d58a5d3bc1f5f0efa4e1efa958395c/src/wikitextprocessor/parser.py#L1032 should be |
See wiktectract issue #1604 tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link This should not be merged as is, because it will create problems in other extractors that might rely on different behavior. In the best-case scenario, there might be two different camps: 1) Languages that use spaces that want to do linktrailing 2) Languages without spaces that can't do linktrailing If this is the case, we might be able to get away with a kludge that checks whether the script of the last character in the link matches the script of the first character after the link.
|
If the linktrailing rules differ between editions, changing it on Wikitextprocessor can create bugs in other extractors. I already made a PR with the regex fix from above, but this needs to be figured out before merging. |
|
The rule is indeed different in some editions, for example, Greek uses this rule: https://github.com/wikimedia/mediawiki/blob/0f67e1045f35a8c854f3c6a3e2712c2d8b0b54d6/languages/messages/MessagesEl.php#L309C2-L309C11 and Japanese language doesn't define the rule so I guess it uses the English [a-z] rule. |
See wiktectract issue #1604 tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link This should not be merged as is, because it will create problems in other extractors that might rely on different behavior. In the best-case scenario, there might be two different camps: 1) Languages that use spaces that want to do linktrailing 2) Languages without spaces that can't do linktrailing If this is the case, we might be able to get away with a kludge that checks whether the script of the last character in the link matches the script of the first character after the link.

Partially* fixes #1602
New json:
{ "word": "appellantur", "lang_code": "la", "lang": "ラテン語", "pos": "verb", "pos_title": "動詞", "senses": [ { "glosses": [ "appellāreの直説法所相現在第三人称複数形。" ], "tags": [ "form-of" ], "form_of": [ { "word": "appellāreの直説法所相現在第三人称複数形" } ] } ], "categories": [ "ラテン語", "ラテン語 動詞 定形" ], "tags": [ "form-of" ] }The solution may not be ideal, but I tried to follow what @xxyzz suggested in the issue.
There needs to be two for loops because
extract_header_nodeshappened after processing glosses, but we need theform-oftag fromextract_header_nodesbefore, in order to succeed atfind_form_of_data.The test only tests for half the logic, that the
form-oftag appears when finding a category that ends in 定形If you have any feedback let me know.
*I am not sure why the raw text is appended to the LINK, is it because there is no space in wiktionary?
wiktextract/src/wiktextract/extractor/ja/pos.py
Line 137 in d146717
This causes:
instead of the desired: