Skip to content

Add auto_identifiers support to Man Reader#11675

Merged
jgm merged 30 commits into
jgm:mainfrom
smc181002:issue-8852
May 30, 2026
Merged

Add auto_identifiers support to Man Reader#11675
jgm merged 30 commits into
jgm:mainfrom
smc181002:issue-8852

Conversation

@smc181002
Copy link
Copy Markdown
Contributor

This PR enables the auto_identifiers extension for the Man reader, allowing parsed headers to receive identifier attributes.

These identifiers are used by writers for section links and table-of-contents generation. The auto_identifiers extension is now enabled by default for Man.

pandoc --toc -s --from man --to html < test-man-8852.1 | tail -n 17
<body>
<header id="title-block-header">
<h1 class="title">TEST</h1>
<p class="date">MAY 2023</p>
</header>
<nav id="TOC" role="doc-toc">
<ul>
<li><a href="#name" id="toc-name">NAME</a></li>
<li><a href="#synopsis" id="toc-synopsis">SYNOPSIS</a></li>
</ul>
</nav>
<h1 id="name">NAME</h1>
<p>text</p>
<h1 id="synopsis">SYNOPSIS</h1>
<p>text</p>
</body>
</html>

New test cases have been added to verify:

  • auto_identifiers
  • gfm_auto_identifiers
  • ascii_auto_identifiers

Closes #8852

smc181002 and others added 25 commits May 17, 2026 14:10
Add support for the auto_identifiers, gfm_auto_identifiers, and
ascii_identifiers extensions in the man reader. Section headings
parsed from .SH and .SS macros now receive auto-generated id
attributes when the extension is enabled, enabling --toc to
produce working anchor links.

- Add autoIdExtensions to default man extensions
- Added HasReaderOptions, HasLogMessages and HasIdentifierList to
  ManState to run registerHeader
- Use headerWith instead of header to attach the computed Attr with
  identifiers

Closes jgm#8852
- The auto identifers are verified with 3 different test cases.
- The same 3 tests cases are verified with GFM Auto identifiers algorithm.
- The AsciiIdentifers option is additionally tested with one test case

Closes jgm#8852
E.g. for zh-Hant-TW look for (in order) zh-Hant-TW.yaml,
zh-Hant.yaml, zh.yaml.

Closes jgm#11648.
If tblHeader exists but has `w:val="0"`, then don't consider
the element a header.

See jgm#8299...but this change doesn't seem to fix things completely.
This led to some table rows being wrongly considered header rows.

We now correctly handle the example from
jgm#8299 (comment)

See jgm#8299.
(Instead of using raw HTML.)

The "aside" class is added to the Div.
Also, add "header" class to Divs created from headers.

See jgm#11626.
...if otherwise the label doesn't come after anything.
(In this case typst will raise an error.)

Closes jgm#11568.
This change ensures that raw content marked `epub2` will appear in (only) EPUBv2 output
and content marked `epub3` will appear in (only) EPUBv3 output.
This fixes a bug which produced too-narrow columns in some cases.

Closes jgm#11664.
`stringify` returns the empty string for a MetaString, so each keyword
in the `cp:keywords` list of `docProps/core.xml` was rendered as empty.
Convert each metadata value like `lookupMetaString` does instead.

Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
We parse these as DefinitionList items, but we previously
sometimes stopped prematurely in including material in the
definition.  We should include everything until we hit a new
indentation-changing macro.

Closes jgm#11668.
Previously the OpenDocument writer emitted a fresh automatic style
(L1..Ln, P1..Pn, T1..Tn) for nearly every list, list-item paragraph,
block quote, preformatted block, and inline text style.  This produced
large ODT files, made `--reference-doc` customization ineffective (the
user's predefined styles were never referenced), and gave each list its
own indentation independent of any containing block quote.

This commit teaches the writer to reference the predefined styles that
LibreOffice ships and that pandoc's reference.odt now exports:

- Bullet lists use `List_20_1`; ordered lists with default start and
  decimal format use `Numbering_20_1`.  Non-default ordered lists
  generate a single named override style (`Pandoc_Numbering_N`)
  memoised by (ListNumberStyle, ListNumberDelim); a non-default start
  value with the default format is expressed via `text:start-value`
  on the `text:list` element instead of a new style.
- List-item paragraphs use `List_20_Bullet[_Tight]` and
  `List_20_Number[_Tight]`.  The Tight variants are pandoc-specific
  (zero top/bottom margin) and are injected into the user's
  reference.odt if missing, just like the Skylighting token styles.
- Block quotes use the predefined `Quotations` paragraph style
  directly.  Nested block quotes use a single automatic style that
  inherits from Quotations and only adds extra margin-left, so a list
  inside a block quote now inherits its container's indent (jgm#2747).
- Preformatted blocks use `Preformatted_20_Text` directly.
- Emphasis, Strong, Strikeout, Subscript, Superscript and Code spans
  use the predefined `Emphasis`, `Strong_20_Emphasis`, `Strikeout`,
  `Subscript`, `Superscript` and `Source_20_Text` text styles.
- `paraStyle`/`paraStyleFromParent` no longer emit a wrapper automatic
  style when its only attribute would be `parent-style-name`; the
  parent name is returned directly.

Closes jgm#9136.
Closes jgm#5086.
Closes jgm#2747.
Closes jgm#3426.
Closes jgm#7336.

Co-authored by: Claude Opus 4.7.
Like the other table syntaxes (pipe, simple, and multiline tables) and
block-level constructs generally, a grid table may now be indented by up
to three spaces and still be recognized as a table.  Previously the
grid-table parser required the table to begin at the left margin, so an
indented grid table was parsed as a paragraph.

The leading indentation is stripped uniformly from each line before the
table is parsed, so an indented grid table produces the same AST as its
non-indented equivalent.

Adds a command test.
@smc181002
Copy link
Copy Markdown
Contributor Author

Looks like there were new commits on main branch with new tests.

I will Rebase from main and make the changes.

Additionally, I have written the previous test cases in Old test suit style. I will move them to test/command/8852.md.

Comment on lines +437 to +438
attr <- registerHeader nullAttr contents

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that registerHeader doesn't emit log messages directly with report; it adds them to a list of log messages in state (using addLogMessage); to make sure that items in this list are actually output, you need to call reportLogMessages after parsing is finished. (I actually no longer remember why we had to do this indirect thing in the markdown reader, rather than using report directly, but there was some reason registerHeader was designed this way.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registerHeader would not write any logs for Man State.
Because registerHeader only generates logs when there are duplicate identifiers.

Logs generate in markdown when we have duplicate identifiers defined in markdown. But for Man, we do not have option for defining identifiers.

I can still add the reportLogMessages in ParseMan function if required.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. OK.

Comment thread test/Tests/Readers/Man.hs Outdated
, "H1" =:
".SH The header\n"
=?> header 1 (text "The header")
=?> headerWith ("",[],[]) 1 (text "The header")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand why this was needed. Isn't header == headerWith ("".[],[])?

Comment thread test/Tests/Old.hs Outdated
]
, testGroup "man"
[ test' "reader" ["-r", "man", "-w", "native", "-s"]
[ test' "reader" ["-r", "man-auto_identifiers", "-w", "native", "-s"]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to test man, and adjust the expected output.
You can do this quickly with make TESTARGS=--accept.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test of the -auto_identifiers case could be moved to the command test above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I agree. I will move the old cases to command tests

@jgm jgm merged commit 1ffeb85 into jgm:main May 30, 2026
6 of 10 checks passed
@jgm
Copy link
Copy Markdown
Owner

jgm commented May 30, 2026

Looks good - I squashed and merged in one commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enable auto_identifiers for man reader

5 participants