[ja] Infer form-of from category by daxida · Pull Request #1604 · tatuylonen/wiktextract

daxida · 2026-03-03T13:30:39Z

Partially* fixes #1602

New json:

{
  "word": "appellantur",
  "lang_code": "la",
  "lang": "ラテン語",
  "pos": "verb",
  "pos_title": "動詞",
  "senses": [
    {
      "glosses": [
        "appellāreの直説法所相現在第三人称複数形。"
      ],
      "tags": [
        "form-of"
      ],
      "form_of": [
        {
          "word": "appellāreの直説法所相現在第三人称複数形"
        }
      ]
    }
  ],
  "categories": [
    "ラテン語",
    "ラテン語 動詞 定形"
  ],
  "tags": [
    "form-of"
  ]
}

The solution may not be ideal, but I tried to follow what @xxyzz suggested in the issue.

There needs to be two for loops because extract_header_nodes happened after processing glosses, but we need the form-of tag from extract_header_nodes before, in order to succeed at find_form_of_data.

The test only tests for half the logic, that the form-of tag appears when finding a category that ends in 定形

If you have any feedback let me know.

*I am not sure why the raw text is appended to the LINK, is it because there is no space in wiktionary?

wiktextract/src/wiktextract/extractor/ja/pos.py

Line 137 in d146717

if "form-of" in word_entry.tags and len(sense.form_of) == 0:

    if "form-of" in word_entry.tags and len(sense.form_of) == 0:
        for link_node in list_item_node.find_child(NodeKind.LINK):
            print(link_node) # < <LINK(['appellare'], ['appellāre']){} 'の直説法所相現在第三人称複数形'>
            form_of = clean_node(wxr, None, link_node)
            if form_of != "":
                sense.form_of.append(AltForm(word=form_of))
                sense.tags.append("form-of")
                break

This causes:

      "form_of": [
        {
          "word": "appellāreの直説法所相現在第三人称複数形"
        }
      ]

instead of the desired:

      "form_of": [
       {
         "word": "appellāre"
       }
     ]

xxyzz · 2026-03-04T05:45:59Z

The link problem is a feature of mediawiki called blend link but the Japanese text doesn't behave like this somehow. It's also called "Word-ending links" at here https://www.mediawiki.org/wiki/Help:Links#Internal_links, the document says these links 'Follow so-called "linktrail rules" localised per each language.', no idea what's linktrail rules.

xxyzz · 2026-03-04T05:57:07Z

I think this is the linktrail rule for English: https://github.com/wikimedia/mediawiki/blob/0f67e1045f35a8c854f3c6a3e2712c2d8b0b54d6/languages/messages/MessagesEn.php#L571-L575

MessagesJa.php file doesn't have this rule.

daxida · 2026-03-04T07:34:39Z

Interesting, I didn't know. I wonder how to approach that part.

It makes sense to not have the blend for Japanese (and probably Chinese and other languages without space separation), or any link would result in the remainder of the sentence being highlighted for no reason.

I also noticed that I'm not entirely sure if the desired output should have been appellare and not appellāre. Do you think taking the link (i.e. appellare) could be a solution? It would remove the blending but I am not sure of the consequences.

Edit: that is, doing something like (with some safety checks for the index):

            form_of = clean_node(wxr, None, link_node.largs[0]) # This extracts appellare

Edit2: the testsuite passes with that change, yet, as stated above, I am worried about introducing a regression in quality, even though it works for this case.

xxyzz · 2026-03-04T08:10:57Z

IMO the displayed text "appellāre" should be used and it probably should be fixed somewhere inside "clean_node()".

daxida · 2026-03-04T09:29:33Z

I have tried a couple things but looking at the failing tests I don't see how I could change clean_node without unwanted side effects.

If I remove the trailing text in clean_node, with lets say:

    def clean_node_handler_fn_default(
        node: WikiNode,
    ) -> Optional[list[Union[str, WikiNode]]]:
        assert isinstance(node, WikiNode)
        match node.kind:
            case NodeKind.TABLE_CELL | NodeKind.TABLE_HEADER_CELL:
                return node.children
            case NodeKind.LINK:
                # <LINK(['appellare'], ['appellāre']){} 'の直説法所相現在第三人称複数形'> ['の直説法所相現在第三人称複数形']
                # we don't want the trailing text > return ['appellāre'] only
                if (
                    wxr.wtp.lang_code == "ja"
                    and len(node.largs) >= 2
                    and node.largs[1]
                    and node.children
                    and all(isinstance(ch, str) for ch in node.children)
                ):
                    node.children.clear()
                    
                    ...

Then it also removes the text from places where we want to keep it (even if I am checking that there are only str as children to ensure only the last link passes the if).

For example, it will remove しものなり:

NODE= <LINK(['からかう'], ['からかひ']){} 'しものなり'> ['しものなり']

at the end of this quote (in the tests):

daxida · 2026-03-04T09:45:13Z

The alternative for comparison (the idea above of using largs with checks):

def clean_node_no_trail(wxr: WiktextractContext, node: WikiNode) -> str:
    assert node.largs
    if len(node.largs) >= 2:
        # [[泣く|泣き]]易い人 > 泣き
        return clean_node(wxr, None, node.largs[1])
    elif len(node.largs) == 1:
        # [[感情]]が傷つき    > 感情
        return clean_node(wxr, None, node.largs[0])
    else:
        print("WARN: called clean_node_no_trail with no largs")
        return ""


def find_form_of_data(
    wxr: WiktextractContext,
    word_entry: WordEntry,
    sense: Sense,
    list_item_node: WikiNode,
) -> None:
    for node in list_item_node.find_child(NodeKind.TEMPLATE):
        if node.template_name.endswith(" of"):
            expanded_node = wxr.wtp.parse(
                wxr.wtp.node_to_wikitext(node), expand_all=True
            )
            for link_node in expanded_node.find_child_recursively(
                NodeKind.LINK
            ):
                form_of = clean_node(wxr, None, link_node)
                if form_of != "":
                    sense.form_of.append(AltForm(word=form_of))
                    break
    if "form-of" in word_entry.tags and len(sense.form_of) == 0:
        for link_node in list_item_node.find_child(NodeKind.LINK):
            form_of = clean_node_no_trail(wxr, link_node) # < CHANGE HERE
            if form_of != "":
                sense.form_of.append(AltForm(word=form_of))
                sense.tags.append("form-of")
                break

xxyzz · 2026-03-04T11:35:40Z

Sorry I misjudged the problem, I think the bug is in the wikitextprocessor package, the tail text should not be parsed into a link node if it doesn't match "[a-z]+".

Code at here https://github.com/tatuylonen/wikitextprocessor/blob/59dc20b855d58a5d3bc1f5f0efa4e1efa958395c/src/wikitextprocessor/parser.py#L1032 should be m = re.match(r"(?s)([a-z]+)(.*)", token)

See wiktectract issue #1604 tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link This should not be merged as is, because it will create problems in other extractors that might rely on different behavior. In the best-case scenario, there might be two different camps: 1) Languages that use spaces that want to do linktrailing 2) Languages without spaces that can't do linktrailing If this is the case, we might be able to get away with a kludge that checks whether the script of the last character in the link matches the script of the first character after the link.

kristian-clausal · 2026-03-05T06:07:13Z

If the linktrailing rules differ between editions, changing it on Wikitextprocessor can create bugs in other extractors.

I already made a PR with the regex fix from above, but this needs to be figured out before merging.

xxyzz · 2026-03-05T06:13:11Z

The rule is indeed different in some editions, for example, Greek uses this rule: https://github.com/wikimedia/mediawiki/blob/0f67e1045f35a8c854f3c6a3e2712c2d8b0b54d6/languages/messages/MessagesEl.php#L309C2-L309C11

and Japanese language doesn't define the rule so I guess it uses the English [a-z] rule.

See wiktectract issue #1604 tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link This should not be merged as is, because it will create problems in other extractors that might rely on different behavior. In the best-case scenario, there might be two different camps: 1) Languages that use spaces that want to do linktrailing 2) Languages without spaces that can't do linktrailing If this is the case, we might be able to get away with a kludge that checks whether the script of the last character in the link matches the script of the first character after the link.

daxida and others added 2 commits March 3, 2026 14:19

[ja] Support extracting form-of from headword derived category

3b5d61a

Merge branch 'tatuylonen:master' into ja-form-of

8bfd5b9

kristian-clausal mentioned this pull request Mar 5, 2026

Do not linktrail if following text is not [a-z]? tatuylonen/wikitextprocessor#414

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ja] Infer form-of from category#1604

[ja] Infer form-of from category#1604
daxida wants to merge 2 commits intotatuylonen:masterfrom
daxida:ja-form-of

daxida commented Mar 3, 2026 •

edited

Loading

Uh oh!

xxyzz commented Mar 4, 2026

Uh oh!

xxyzz commented Mar 4, 2026

Uh oh!

daxida commented Mar 4, 2026 •

edited

Loading

Uh oh!

xxyzz commented Mar 4, 2026

Uh oh!

daxida commented Mar 4, 2026

Uh oh!

daxida commented Mar 4, 2026

Uh oh!

xxyzz commented Mar 4, 2026 •

edited

Loading

Uh oh!

kristian-clausal commented Mar 5, 2026

Uh oh!

xxyzz commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daxida commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xxyzz commented Mar 4, 2026

Uh oh!

xxyzz commented Mar 4, 2026

Uh oh!

daxida commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xxyzz commented Mar 4, 2026

Uh oh!

daxida commented Mar 4, 2026

Uh oh!

daxida commented Mar 4, 2026

Uh oh!

xxyzz commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kristian-clausal commented Mar 5, 2026

Uh oh!

xxyzz commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daxida commented Mar 3, 2026 •

edited

Loading

daxida commented Mar 4, 2026 •

edited

Loading

xxyzz commented Mar 4, 2026 •

edited

Loading

xxyzz commented Mar 5, 2026 •

edited

Loading