Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contributing

Thanks for helping improve the DocLang standard and reference validator.
Thanks for helping improve the DocLang standard and reference toolkit.

## Prerequisites

Expand All @@ -20,7 +20,7 @@ CI installs only the `ci` group (`uv sync --frozen --no-default-groups --group c
## Repository layout

- **`spec.md`** — normative specification
- **`doclang/`** — reference validator (XSD, Schematron, CLI); see [doclang/README.md](./doclang/README.md) for package usage
- **`doclang/`** — reference toolkit (Python package, CLI); see [doclang/README.md](./doclang/README.md) for usage
- **`reference/`** — source data for Appendix A (Excel, examples)
- **`exports/`** — generated Word exports from `spec.md`
- **`utils/`** — maintenance scripts (version sync, reference generation, DOCX export, release preparation)
Expand Down
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,28 +16,34 @@

**[DocLang](https://www.doclang.ai/) is the AI-native markup format for unstructured content** — including documents, images, and more. It maps cleanly to LLM tokens while preserving structure, semantics, layout, and geometry in a single, unambiguous representation.

This repository is the home of the normative specification and the reference validator for DocLang. If you build with LLMs and VLMs on real-world content, this is where the standard lives.
This repository is the home of the normative specification and the reference toolkit for DocLang. If you build with LLMs and VLMs on real-world content, this is where the standard lives.

## Specification

The source of the specification is available in [spec.md](https://github.com/doclang-project/doclang/blob/main/spec.md)
and exports to different formats can be found in the [exports/](https://github.com/doclang-project/doclang/tree/main/exports)
directory.

## Reference Validator
## Reference Toolkit

You can install the validator from PyPI:
You can install the toolkit from PyPI:

```bash
pip install doclang
```

You can then validate a DocLang document as follows:
### Validation

```bash
doclang validate -n my_document.dclg
```

### Packaging

```bash
doclang pack my_document.dclg
```

For more details, see the [doclang/README.md](https://github.com/doclang-project/doclang/blob/main/doclang/README.md).

## Citation
Expand Down
44 changes: 38 additions & 6 deletions doclang/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# DocLang Validation
# DocLang Toolkit

Validate DocLang XML documents against XSD schema and Schematron rules.
Official Python toolkit for working with DocLang — CLI commands and library APIs.

## Installation

```bash
pip install doclang
```

## Usage
## CLI

### Basic CLI Usage
### Validation

```bash
doclang validate my_document.dclg
```

### More CLI Usage Scenarios
#### More validation scenarios

```bash
## Inject DocLang namespace if document doesn't declare it:
Expand All @@ -38,7 +38,26 @@ doclang validate my_document.dclg --quiet
doclang --help
```

### Python API
### Packaging

```bash
doclang pack markup.dclg
```

#### More packaging scenarios

```bash
doclang pack markup.dclg -o report.dclx
doclang pack markup.dclg --pages screenshots/
doclang pack markup.dclg --page a.png --page b.png
doclang pack markup.dclg --asset chart.svg=exports/diagram.svg
doclang pack markup.dclg --assets payload/
doclang pack markup.dclg --validate
```

## Python API

### Validation

```python
from doclang import validate, ValidationError
Expand All @@ -52,6 +71,19 @@ except ValidationError as exc:
print(f"{exc.schematron_errors=}")
```

### Packaging

```python
from doclang import pack, PackagingError

path = pack(
"markup.dclg",
pages="screenshots/",
assets={"chart.svg": "exports/diagram.svg"},
)
print(f"Created {path}")
```

## Validation Rules

### XSD Validation (doclang.xsd)
Expand Down
5 changes: 3 additions & 2 deletions doclang/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""DocLang reference validator."""
"""DocLang reference toolkit."""

from doclang.packaging import PackagingError, pack
from doclang.validation import ValidationError, validate

__all__ = ["ValidationError", "validate"]
__all__ = ["PackagingError", "ValidationError", "pack", "validate"]
184 changes: 184 additions & 0 deletions doclang/_packaging.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
"""Internal implementation for DocLang archive packaging."""

from __future__ import annotations

import shutil
import tempfile
import zipfile
from collections.abc import Mapping, Sequence
from pathlib import Path
from typing import Union

_CONTENT_TYPES_XML = """\
<?xml version="1.0" encoding="UTF-8"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="png" ContentType="image/png"/>
<Default Extension="jpg" ContentType="image/jpeg"/>
<Default Extension="jpeg" ContentType="image/jpeg"/>
<Default Extension="webp" ContentType="image/webp"/>
<Override PartName="/document.xml" ContentType="application/vnd.doclang.document+xml"/>
</Types>
"""

_RELS_XML = """\
<?xml version="1.0" encoding="UTF-8"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="http://doclang.ai/ns/package/2026/relationships/document"
Target="document.xml"/>
</Relationships>
"""

PagesInput = Union[
str,
Path,
Sequence[Union[str, Path]],
Mapping[int, Union[str, Path]],
]

AssetsInput = Union[
str,
Path,
Mapping[str, Union[str, Path]],
]


class PackagingError(Exception):
"""Raised when DocLang archive packaging fails."""


def _require_file(path: Path, *, label: str) -> None:
if not path.is_file():
raise PackagingError(f"{label} not found or not a file: {path}")


def _require_directory(path: Path, *, label: str) -> None:
if not path.is_dir():
raise PackagingError(f"{label} not found or not a directory: {path}")


def _validate_archive_relative_path(path: str, *, label: str) -> None:
if not path or path.startswith("/") or "\\" in path:
raise PackagingError(f"Invalid {label} path: {path!r}")
parts = Path(path).parts
if ".." in parts or path in {".", ".."}:
raise PackagingError(f"Invalid {label} path: {path!r}")


def _copy_tree_into(source: Path, destination: Path) -> None:
destination.mkdir(parents=True, exist_ok=True)
for item in source.iterdir():
target = destination / item.name
if item.is_dir():
shutil.copytree(item, target, dirs_exist_ok=True)
else:
shutil.copy2(item, target)


def _place_document(stage: Path, document: Path) -> None:
_require_file(document, label="Document")
shutil.copy2(document, stage / "document.xml")


def _place_pages(stage: Path, pages: PagesInput) -> None:
pages_dir = stage / "pages"
if isinstance(pages, Mapping):
for page_number, source in pages.items():
if not isinstance(page_number, int) or page_number < 1:
raise PackagingError(f"Page numbers must be positive integers, got {page_number!r}")
source_path = Path(source)
_require_file(source_path, label="Page file")
destination = pages_dir / f"{page_number}{source_path.suffix}"
destination.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(source_path, destination)
return

if isinstance(pages, str | Path):
source_dir = Path(pages)
_require_directory(source_dir, label="Pages directory")
_copy_tree_into(source_dir, pages_dir)
return

for index, source in enumerate(pages, start=1):
source_path = Path(source)
_require_file(source_path, label="Page file")
destination = pages_dir / f"{index}{source_path.suffix}"
destination.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(source_path, destination)


def _place_assets(stage: Path, assets: AssetsInput) -> None:
assets_dir = stage / "assets"
if isinstance(assets, Mapping):
for archive_path, source in assets.items():
_validate_archive_relative_path(archive_path, label="asset")
source_path = Path(source)
_require_file(source_path, label="Asset file")
destination = assets_dir / archive_path
destination.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(source_path, destination)
return

source_dir = Path(assets)
_require_directory(source_dir, label="Assets directory")
_copy_tree_into(source_dir, assets_dir)


def _write_opc_metadata(stage: Path) -> None:
(stage / "[Content_Types].xml").write_text(_CONTENT_TYPES_XML, encoding="utf-8")
rels_dir = stage / "_rels"
rels_dir.mkdir(parents=True, exist_ok=True)
(rels_dir / ".rels").write_text(_RELS_XML, encoding="utf-8")


def _should_exclude_zip_member(arcname: str) -> bool:
parts = arcname.split("/")
if "__MACOSX" in parts:
return True
name = parts[-1]
return name == ".DS_Store" or name.startswith("._")


def _create_zip(stage: Path, output: Path) -> None:
output.parent.mkdir(parents=True, exist_ok=True)
if output.exists():
output.unlink()
with zipfile.ZipFile(output, "w", compression=zipfile.ZIP_DEFLATED) as archive:
for path in sorted(stage.rglob("*")):
if not path.is_file():
continue
arcname = path.relative_to(stage).as_posix()
if _should_exclude_zip_member(arcname):
continue
archive.write(path, arcname)


def _pack(
document: Union[str, Path],
*,
output: Union[str, Path, None] = None,
pages: PagesInput | None = None,
assets: AssetsInput | None = None,
validate: bool = False,
) -> Path:
document_path = Path(document)
output_path = Path(output) if output is not None else document_path.with_suffix(".dclx")

with tempfile.TemporaryDirectory() as temp_dir:
stage = Path(temp_dir)
_place_document(stage, document_path)
if pages is not None:
_place_pages(stage, pages)
if assets is not None:
_place_assets(stage, assets)
_write_opc_metadata(stage)

if validate:
from doclang.validation import validate as validate_document

validate_document(stage / "document.xml")

_create_zip(stage, output_path)

return output_path.resolve()
Loading
Loading