Skip to content

[ja] Infer form-of from category#1604

Draft
daxida wants to merge 2 commits intotatuylonen:masterfrom
daxida:ja-form-of
Draft

[ja] Infer form-of from category#1604
daxida wants to merge 2 commits intotatuylonen:masterfrom
daxida:ja-form-of

Conversation

@daxida
Copy link
Contributor

@daxida daxida commented Mar 3, 2026

Partially* fixes #1602

New json:

{
  "word": "appellantur",
  "lang_code": "la",
  "lang": "ラテン語",
  "pos": "verb",
  "pos_title": "動詞",
  "senses": [
    {
      "glosses": [
        "appellāreの直説法所相現在第三人称複数形。"
      ],
      "tags": [
        "form-of"
      ],
      "form_of": [
        {
          "word": "appellāreの直説法所相現在第三人称複数形"
        }
      ]
    }
  ],
  "categories": [
    "ラテン語",
    "ラテン語 動詞 定形"
  ],
  "tags": [
    "form-of"
  ]
}

The solution may not be ideal, but I tried to follow what @xxyzz suggested in the issue.

There needs to be two for loops because extract_header_nodes happened after processing glosses, but we need the form-of tag from extract_header_nodes before, in order to succeed at find_form_of_data.

The test only tests for half the logic, that the form-of tag appears when finding a category that ends in 定形

If you have any feedback let me know.


*I am not sure why the raw text is appended to the LINK, is it because there is no space in wiktionary?

if "form-of" in word_entry.tags and len(sense.form_of) == 0:

    if "form-of" in word_entry.tags and len(sense.form_of) == 0:
        for link_node in list_item_node.find_child(NodeKind.LINK):
            print(link_node) # < <LINK(['appellare'], ['appellāre']){} 'の直説法所相現在第三人称複数形'>
            form_of = clean_node(wxr, None, link_node)
            if form_of != "":
                sense.form_of.append(AltForm(word=form_of))
                sense.tags.append("form-of")
                break

This causes:

      "form_of": [
        {
          "word": "appellāreの直説法所相現在第三人称複数形"
        }
      ]

instead of the desired:

      "form_of": [
       {
         "word": "appellāre"
       }
     ]

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 4, 2026

The link problem is a feature of mediawiki called blend link but the Japanese text doesn't behave like this somehow. It's also called "Word-ending links" at here https://www.mediawiki.org/wiki/Help:Links#Internal_links, the document says these links 'Follow so-called "linktrail rules" localised per each language.', no idea what's linktrail rules.

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 4, 2026

I think this is the linktrail rule for English: https://github.com/wikimedia/mediawiki/blob/0f67e1045f35a8c854f3c6a3e2712c2d8b0b54d6/languages/messages/MessagesEn.php#L571-L575

MessagesJa.php file doesn't have this rule.

@daxida
Copy link
Contributor Author

daxida commented Mar 4, 2026

Interesting, I didn't know. I wonder how to approach that part.

It makes sense to not have the blend for Japanese (and probably Chinese and other languages without space separation), or any link would result in the remainder of the sentence being highlighted for no reason.

I also noticed that I'm not entirely sure if the desired output should have been appellare and not appellāre. Do you think taking the link (i.e. appellare) could be a solution? It would remove the blending but I am not sure of the consequences.


Edit: that is, doing something like (with some safety checks for the index):

            form_of = clean_node(wxr, None, link_node.largs[0]) # This extracts appellare

Edit2: the testsuite passes with that change, yet, as stated above, I am worried about introducing a regression in quality, even though it works for this case.

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 4, 2026

IMO the displayed text "appellāre" should be used and it probably should be fixed somewhere inside "clean_node()".

@daxida
Copy link
Contributor Author

daxida commented Mar 4, 2026

I have tried a couple things but looking at the failing tests I don't see how I could change clean_node without unwanted side effects.

If I remove the trailing text in clean_node, with lets say:

    def clean_node_handler_fn_default(
        node: WikiNode,
    ) -> Optional[list[Union[str, WikiNode]]]:
        assert isinstance(node, WikiNode)
        match node.kind:
            case NodeKind.TABLE_CELL | NodeKind.TABLE_HEADER_CELL:
                return node.children
            case NodeKind.LINK:
                # <LINK(['appellare'], ['appellāre']){} 'の直説法所相現在第三人称複数形'> ['の直説法所相現在第三人称複数形']
                # we don't want the trailing text > return ['appellāre'] only
                if (
                    wxr.wtp.lang_code == "ja"
                    and len(node.largs) >= 2
                    and node.largs[1]
                    and node.children
                    and all(isinstance(ch, str) for ch in node.children)
                ):
                    node.children.clear()
                    
                    ...

Then it also removes the text from places where we want to keep it (even if I am checking that there are only str as children to ensure only the last link passes the if).

For example, it will remove しものなり:

NODE= <LINK(['からかう'], ['からかひ']){} 'しものなり'> ['しものなり']

at the end of this quote (in the tests):

image

@daxida
Copy link
Contributor Author

daxida commented Mar 4, 2026

The alternative for comparison (the idea above of using largs with checks):

def clean_node_no_trail(wxr: WiktextractContext, node: WikiNode) -> str:
    assert node.largs
    if len(node.largs) >= 2:
        # [[泣く|泣き]]易い人 > 泣き
        return clean_node(wxr, None, node.largs[1])
    elif len(node.largs) == 1:
        # [[感情]]が傷つき    > 感情
        return clean_node(wxr, None, node.largs[0])
    else:
        print("WARN: called clean_node_no_trail with no largs")
        return ""


def find_form_of_data(
    wxr: WiktextractContext,
    word_entry: WordEntry,
    sense: Sense,
    list_item_node: WikiNode,
) -> None:
    for node in list_item_node.find_child(NodeKind.TEMPLATE):
        if node.template_name.endswith(" of"):
            expanded_node = wxr.wtp.parse(
                wxr.wtp.node_to_wikitext(node), expand_all=True
            )
            for link_node in expanded_node.find_child_recursively(
                NodeKind.LINK
            ):
                form_of = clean_node(wxr, None, link_node)
                if form_of != "":
                    sense.form_of.append(AltForm(word=form_of))
                    break
    if "form-of" in word_entry.tags and len(sense.form_of) == 0:
        for link_node in list_item_node.find_child(NodeKind.LINK):
            form_of = clean_node_no_trail(wxr, link_node) # < CHANGE HERE
            if form_of != "":
                sense.form_of.append(AltForm(word=form_of))
                sense.tags.append("form-of")
                break               

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 4, 2026

Sorry I misjudged the problem, I think the bug is in the wikitextprocessor package, the tail text should not be parsed into a link node if it doesn't match "[a-z]+".

Code at here https://github.com/tatuylonen/wikitextprocessor/blob/59dc20b855d58a5d3bc1f5f0efa4e1efa958395c/src/wikitextprocessor/parser.py#L1032 should be m = re.match(r"(?s)([a-z]+)(.*)", token)

kristian-clausal added a commit to tatuylonen/wikitextprocessor that referenced this pull request Mar 5, 2026
See wiktectract issue #1604
tatuylonen/wiktextract#1604
https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

This should not be merged as is, because it will create problems in
other extractors that might rely on different behavior.

In the best-case scenario, there might be two different camps:
1) Languages that use spaces that want to do linktrailing
2) Languages without spaces that can't do linktrailing

If this is the case, we might be able to get away with a
kludge that checks whether the script of the last character
in the link matches the script of the first character after
the link.
@kristian-clausal
Copy link
Collaborator

If the linktrailing rules differ between editions, changing it on Wikitextprocessor can create bugs in other extractors.

I already made a PR with the regex fix from above, but this needs to be figured out before merging.

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 5, 2026

The rule is indeed different in some editions, for example, Greek uses this rule: https://github.com/wikimedia/mediawiki/blob/0f67e1045f35a8c854f3c6a3e2712c2d8b0b54d6/languages/messages/MessagesEl.php#L309C2-L309C11

and Japanese language doesn't define the rule so I guess it uses the English [a-z] rule.

kristian-clausal added a commit to tatuylonen/wikitextprocessor that referenced this pull request Mar 5, 2026
See wiktectract issue #1604
tatuylonen/wiktextract#1604
https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

This should not be merged as is, because it will create problems in
other extractors that might rely on different behavior.

In the best-case scenario, there might be two different camps:
1) Languages that use spaces that want to do linktrailing
2) Languages without spaces that can't do linktrailing

If this is the case, we might be able to get away with a
kludge that checks whether the script of the last character
in the link matches the script of the first character after
the link.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ja] Form-of from template

3 participants