Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions doclang/doclang.sch
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@

<sch:pattern id="list-structure">
<sch:rule context="dl:list[*]">
<sch:let name="first-non-header" value="*[not(self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:custom)][1]"/>
<sch:let name="first-non-header" value="*[not(self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:description or self::dl:summary or self::dl:custom)][1]"/>

<sch:assert test="not($first-non-header) or $first-non-header[self::dl:ldiv]">
List must have ldiv as first element after optional element head (property elements: label, thread, xref, href, layer, location, caption, custom).
List must have ldiv as first element after optional element head (property elements: label, thread, xref, href, layer, location, caption, description, summary, custom).
Found: <sch:value-of select="if ($first-non-header) then name($first-non-header) else 'nothing'"/>
</sch:assert>
</sch:rule>
Expand All @@ -33,13 +33,13 @@

<sch:pattern id="table-structure">
<sch:rule context="dl:table[*] | dl:index[*]">
<sch:let name="first-non-header" value="*[not(self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:custom)][1]"/>
<sch:let name="first-non-header" value="*[not(self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:description or self::dl:summary or self::dl:custom)][1]"/>

<sch:assert test="not($first-non-header) or
$first-non-header[self::dl:fcel or self::dl:ecel or self::dl:ched or
self::dl:rhed or self::dl:corn or self::dl:srow or
self::dl:lcel or self::dl:ucel or self::dl:xcel]">
Table and index must have cell-starting token as first element after optional element head (property elements: label, thread, xref, href, layer, location, caption, custom).
Table and index must have cell-starting token as first element after optional element head (property elements: label, thread, xref, href, layer, location, caption, description, summary, custom).
Found: <sch:value-of select="if ($first-non-header) then name($first-non-header) else 'nothing'"/>
</sch:assert>
</sch:rule>
Expand Down Expand Up @@ -71,21 +71,21 @@

<!-- ============================================ -->
<!-- ELEMENT HEAD: Text must not precede property elements -->
<!-- Property elements: label, thread, xref, href, layer, location, caption, custom (per XSD element_head group) -->
<!-- Property elements: label, thread, xref, href, layer, location, caption, description, summary, custom (per XSD element_head group) -->
<!-- This rule applies to regular semantic elements AND virtual <text> in lists/tables -->
<!-- ============================================ -->

<sch:pattern id="element-head-placement">
<sch:rule context="dl:text | dl:heading | dl:code | dl:formula | dl:caption |
<sch:rule context="dl:text | dl:heading | dl:code | dl:formula | dl:caption | dl:description | dl:summary |
dl:page_header | dl:page_footer | dl:footnote | dl:picture | dl:marker |
dl:field_region | dl:field_heading | dl:field_item | dl:key | dl:value |
dl:list | dl:table | dl:index | dl:group">
<sch:let name="header-elements" value="dl:label | dl:thread | dl:xref | dl:href | dl:layer | dl:location | dl:caption | dl:custom"/>
<sch:let name="header-elements" value="dl:label | dl:thread | dl:xref | dl:href | dl:layer | dl:location | dl:caption | dl:description | dl:summary | dl:custom"/>

<sch:let name="text-before-header" value="text()[following-sibling::*[self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:custom]]"/>
<sch:let name="text-before-header" value="text()[following-sibling::*[self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:description or self::dl:summary or self::dl:custom]]"/>

<sch:assert test="every $t in $text-before-header satisfies normalize-space($t) = ''">
Property elements in the element head (label, thread, xref, href, layer, location, caption, custom) must appear before any non-whitespace text content.
Property elements in the element head (label, thread, xref, href, layer, location, caption, description, summary, custom) must appear before any non-whitespace text content.
Found non-whitespace text before element head: '<sch:value-of select="normalize-space(string-join($text-before-header, ''))"/>'
</sch:assert>
</sch:rule>
Expand Down Expand Up @@ -234,7 +234,7 @@
then following-sibling::node()[following-sibling::dl:ldiv[1] is $next-ldiv]
else following-sibling::node()"/>

<sch:let name="header-elements" value="$item-content[self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:custom]"/>
<sch:let name="header-elements" value="$item-content[self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:description or self::dl:summary or self::dl:custom]"/>

<sch:let name="first-header-index" value="if ($header-elements)
then index-of($item-content, $header-elements[1])[1]
Expand All @@ -246,7 +246,7 @@
else ()"/>

<sch:assert test="empty($text-before-header)">
In list items (virtual text), property elements in the element head (label, thread, xref, href, layer, location, caption, custom) must appear before any non-whitespace text content.
In list items (virtual text), property elements in the element head (label, thread, xref, href, layer, location, caption, description, summary, custom) must appear before any non-whitespace text content.
Found non-whitespace text before element head: '<sch:value-of select="normalize-space(string-join($text-before-header, ''))"/>'
</sch:assert>
</sch:rule>
Expand Down Expand Up @@ -274,7 +274,7 @@
then following-sibling::node()[following-sibling::*[. is $next-token]]
else following-sibling::node()[not(following-sibling::dl:nl)]"/>

<sch:let name="header-elements" value="$cell-content[self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:custom]"/>
<sch:let name="header-elements" value="$cell-content[self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:description or self::dl:summary or self::dl:custom]"/>

<sch:let name="first-header-index" value="if ($header-elements)
then index-of($cell-content, $header-elements[1])[1]
Expand All @@ -286,7 +286,7 @@
else ()"/>

<sch:assert test="empty($text-before-header)">
In table and index cells (virtual text), property elements in the element head (label, thread, xref, href, layer, location, caption, custom) must appear before any non-whitespace text content.
In table and index cells (virtual text), property elements in the element head (label, thread, xref, href, layer, location, caption, description, summary, custom) must appear before any non-whitespace text content.
Found non-whitespace text before element head: '<sch:value-of select="normalize-space(string-join($text-before-header, ''))"/>'
</sch:assert>
</sch:rule>
Expand Down Expand Up @@ -337,7 +337,7 @@

<sch:pattern id="picture-body">
<sch:rule context="dl:picture">
<sch:let name="first-body" value="*[not(self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:custom)][1]"/>
<sch:let name="first-body" value="*[not(self::dl:label or self::dl:thread or self::dl:xref or self::dl:href or self::dl:layer or self::dl:location or self::dl:caption or self::dl:description or self::dl:summary or self::dl:custom)][1]"/>

<sch:assert test="empty(dl:tabular) or @class = 'chart'">
Element tabular is only allowed in picture with class="chart".
Expand Down
18 changes: 16 additions & 2 deletions doclang/doclang.xsd
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@
</xs:element>

<!-- ============================================ -->
<!-- ELEMENT HEAD (element_head): optional label, thread, xref or href, layer, location_block, caption, custom -->
<!-- ELEMENT HEAD (element_head): optional label, thread, xref or href, layer, location_block, caption, description, summary, custom -->
<!-- ============================================ -->

<xs:group name="element_head">
Expand All @@ -232,6 +232,8 @@
<xs:element ref="dl:layer" minOccurs="0"/>
<xs:group ref="dl:location_block" minOccurs="0"/>
<xs:element ref="dl:caption" minOccurs="0"/>
<xs:element ref="dl:description" minOccurs="0"/>
<xs:element ref="dl:summary" minOccurs="0"/>
<xs:element ref="dl:custom" minOccurs="0"/>
</xs:sequence>
</xs:group>
Expand Down Expand Up @@ -396,6 +398,18 @@

<xs:element name="caption" type="dl:component_with_semantic_seq"/>

<!-- ============================================ -->
<!-- DESCRIPTION ELEMENT: raw text only (optionally wrapped in content) -->
<!-- ============================================ -->

<xs:element name="description" type="dl:content_cat"/>

<!-- ============================================ -->
<!-- SUMMARY ELEMENT: raw text only (optionally wrapped in content) -->
<!-- ============================================ -->

<xs:element name="summary" type="dl:content_cat"/>

<!-- ============================================ -->
<!-- PAGE_HEADER ELEMENT: uses component_with_semantic_seq -->
<!-- ============================================ -->
Expand Down Expand Up @@ -442,7 +456,7 @@
<!-- ============================================ -->
<!-- TOP_LEVEL_CAT: reusable group for top-level document elements -->
<!-- Includes: text, heading, code, formula, page_header, page_footer, footnote, list, field_region, field_heading, field_item, key, value, picture, table, index, group -->
<!-- Note: caption appears only in the element head (element_head), not in body content -->
<!-- Note: caption, description, and summary appear only in the element head (element_head), not in body content -->
<!-- ============================================ -->

<xs:group name="top_level_cat">
Expand Down
Binary file modified exports/doclang-styled.docx
Binary file not shown.
Binary file modified exports/doclang.docx
Binary file not shown.
Binary file modified reference/input/reference.xlsx
Binary file not shown.
66 changes: 63 additions & 3 deletions spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,8 @@ The XML content of a semantic element begins with an *element head*, which is a
- [`<layer>`](#layer) (optional)
- optional sequence of 4 [`<location>`](#location)s, whereby values are interpreted in alternating axis order, as `x_min, y_min, x_max, y_max` (after resolution normalization), w.r.t. the top-left corner of the page
- [`<caption>`](#caption) (optional)
- [`<description>`](#description) (optional)
- [`<summary>`](#summary) (optional)
- [`<custom>`](#custom) (optional)

#### Element Body
Expand Down Expand Up @@ -338,6 +340,20 @@ Bar chart using [recommended label](#recommendations) and [`<tabular>`](#tabular
</picture>
```

Picture with document [caption](#caption), [description](#description), and [summary](#summary):

```xml
<picture>
<caption>FIG. 2. System architecture</caption>
<description>
A block diagram with a browser client, an application server, and a database.
Arrows show HTTP requests from client to server and SQL queries from server to database.
</description>
<summary>The system uses a three-tier architecture.</summary>
<src uri="fig2.png"/>
</picture>
```

### Code snippets

Code content is captured with `<code>`, either as a standalone block or inlined within a semantic element. For language classification, use a [`<label>`](#label) in the element head (see [Recommendations](#recommendations)).
Expand Down Expand Up @@ -2441,7 +2457,7 @@ None

##### `<caption>`

Optional part of the element head for capturing an associated caption.
Optional part of the element head for capturing an associated caption. Unlike [`<description>`](#description) or [`<summary>`](#summary), [`<caption>`](#caption) is an actual document component, which can have its own location information etc. For example, a caption shown underneath a chart.

###### Allowed Context

Expand Down Expand Up @@ -2539,6 +2555,46 @@ Can only be part of the [element head](#element-head) of a semantic element.
| Raw text | Not allowed |
| Primary semantic elements | Not allowed |

##### `<description>`

Optional part of the element head for capturing a derived textual account of what the host component is or what it shows. Unlike [`<caption>`](#caption), [`<description>`](#description) is meta-information and not an actual document component. For example, a picture description inferred by a model.

###### Allowed Context

Can only be part of the [element head](#element-head) of a semantic element.

###### Attributes

None

###### Allowed Content Types

| Content Type | Allowed / Not allowed |
| --- | --- |
| Element head | Not allowed |
| Raw text | Allowed |
| Primary semantic elements | Not allowed |

##### `<summary>`

Optional part of the element head for capturing a derived textual distillation of what the host component conveys. Unlike [`<caption>`](#caption), [`<summary>`](#summary) is meta-information and not part of the original document content.

###### Allowed Context

Can only be part of the [element head](#element-head) of a semantic element.

###### Attributes

None

###### Allowed Content Types

| Content Type | Allowed / Not allowed |
| --- | --- |
| Element head | Not allowed |
| Raw text | Allowed |
| Primary semantic elements | Not allowed |

##### `<custom>`

Optional part of the element head; custom metadata, e.g. for application-specific purposes. See [Recommendations](#recommendations) for naming and namespacing guidance for custom vocabularies.
Expand Down Expand Up @@ -3221,6 +3277,10 @@ The token vocabulary trades off size and inference cost:
| `</hint>` | [`hint`](#hint) end |
| `<caption>` | [`caption`](#caption) start |
| `</caption>` | [`caption`](#caption) end |
| `<description>` | [`description`](#description) start |
| `</description>` | [`description`](#description) end |
| `<summary>` | [`summary`](#summary) start |
| `</summary>` | [`summary`](#summary) end |
| `<thread thread_id="` | [`thread`](#thread) with `thread_id` attribute start |
| `<xref thread_id="` | [`xref`](#xref) with `thread_id` attribute start |
| `<href uri="` | [`href`](#href) with `uri` attribute start |
Expand Down Expand Up @@ -3264,7 +3324,7 @@ The token vocabulary trades off size and inference cost:
| `<ucel/>` | [`ucel`](#ucel) |
| `<xcel/>` | [`xcel`](#xcel) |
| `<nl/>` | [`nl`](#nl) |
| `<ldiv/>` | [`lddiv`](#ldiv) |
| `<ldiv/>` | [`ldiv`](#ldiv) |
| `<ldiv><marker>` | start of [`ldiv`](#ldiv) with [`marker`](#marker) |
| `</marker></ldiv>` | end of [`ldiv`](#ldiv) with [`marker`](#marker) |
| `<location value="0"/>`, `<location value="1"/>`, ..., `<location value="511"/>` | [`location`](#location) tokens with values from 0 to 511 |
Expand Down Expand Up @@ -3292,7 +3352,7 @@ Below we list the reserved core metadata elements to be used within `<head>`:
- `language`, Identifies the (human) language of the document, e.g., English, German, French, Spanish, Japanese. The content MUST be an [ISO 639-3](https://iso639-3.sil.org/about) language identifier. Optional attributes: `classifier` (the tool/method used, e.g., fastText) and `score` (confidence in [0, 1]). Multiple `language` entries MAY be provided.
- `generated_by`, upstream pipeline information, e.g. VLM ID
- `topic`, topic that the document is most likely to fall in such as Science and Technology, Legal, etc. The topics should preferrably come from some taxonomy. Classifier defines the classifier used for classifying into the given topic and score is the confidence score of classifier and 0<=Scores<=1. This can be one or more.
- `summary`, a summary of the document
- `summary`, a summary of the document (document-level; distinct from element-head [`<summary>`](#summary) on individual components)
- `document_hash`, Hash of the document, whereas hash_function defines the algorithm used to compute the hash, e.g., SHA2. This can be one or more.

Here is an example:
Expand Down
11 changes: 11 additions & 0 deletions tests/data/invalid/nok_description_before_caption.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<!-- description must follow caption in the element head -->
<picture>
<description>Derived description of the figure.</description>
<caption>FIG. 1. Document caption</caption>
<src uri="fig1.png"/>
</picture>

</doclang>
10 changes: 10 additions & 0 deletions tests/data/invalid/nok_description_in_body.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<!-- description must be in the element head, not after body content -->
<group>
<text>Body content first.</text>
<description>Late description</description>
</group>

</doclang>
7 changes: 7 additions & 0 deletions tests/data/invalid/nok_description_in_doclang.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<!-- description must be in a semantic element's element head, not a direct child of doclang -->
<description>Standalone description</description>

</doclang>
13 changes: 13 additions & 0 deletions tests/data/invalid/nok_rich_description.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<table>
<description><bold>This</bold> should not be possible.</description>
<fcel/>
<text>Q1</text>
<fcel/>
<text>100</text>
<nl/>
</table>

</doclang>
13 changes: 13 additions & 0 deletions tests/data/invalid/nok_rich_summary.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<table>
<summary><bold>This</bold> should not be possible.</summary>
<fcel/>
<text>Q1</text>
<fcel/>
<text>100</text>
<nl/>
</table>

</doclang>
12 changes: 12 additions & 0 deletions tests/data/invalid/nok_summary_before_description.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<!-- summary must follow description in the element head -->
<picture>
<summary>Derived gist.</summary>
<description>Derived account of the figure.</description>
<caption>FIG. 1. Document caption</caption>
<src uri="fig1.png"/>
</picture>

</doclang>
10 changes: 10 additions & 0 deletions tests/data/invalid/nok_summary_in_body.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<!-- summary must be in the element head, not after body content -->
<group>
<text>Body content first.</text>
<summary>Late summary</summary>
</group>

</doclang>
7 changes: 7 additions & 0 deletions tests/data/invalid/nok_summary_in_doclang.dclg
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<doclang xmlns="https://www.doclang.ai/ns/v0">

<!-- summary must be in a semantic element's element head, not a direct child of doclang -->
<summary>Standalone summary</summary>

</doclang>
Loading
Loading