and contributors"]
version = "0.3.8"
-
-[deps]
-Mmap = "a63ad114-7e13-5084-954f-fe012c677804"
-OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
+authors = ["Josh Day and contributors"]
[compat]
-OrderedCollections = "1.4, 1.5"
julia = "1.6"
diff --git a/README.md b/README.md
index ddb1156..b446f61 100644
--- a/README.md
+++ b/README.md
@@ -4,39 +4,8 @@
Read and write XML in pure Julia.
-
-
-# Introduction
-
-This package offers fast data structures for reading and writing XML files with a consistent interface:
-
-### `Node`/`LazyNode` Interface:
-
-```
-nodetype(node) → XML.NodeType (an enum type)
-tag(node) → String or Nothing
-attributes(node) → OrderedDict{String, String} or Nothing
-value(node) → String or Nothing
-children(node) → Vector{typeof(node)}
-is_simple(node) → Bool (whether node is simple .e.g. item)
-simple_value(node) → e.g. "item" from item)
-```
-
-
-
-### Extended Interface for `LazyNode`
-
-```
-depth(node) → Int
-next(node) → typeof(node)
-prev(node) → typeof(node)
-parent(node) → typeof(node)
-```
-
-
-
# Quickstart
```julia
@@ -58,79 +27,76 @@ doc[end][2] # Second child of root
# Node Element (6 children)
```
-
-
-# Data Structures that Represent XML Nodes
+
-## Preliminary: `NodeType`
+# `Node` Interface
-- Each item in an XML DOM is classified by its `NodeType`.
-- Every `XML.jl` struct defines a `nodetype(x)` method that returns its `NodeType`.
+Every node in the XML DOM is represented by `Node`, a single type parametrized on its string storage.
-| NodeType | XML Representation | `Node` Constructor |
-|----------|--------------------|------------------|
-| `Document` | An entire document | `Document(children...)`
-| `DTD` | `` | `DTD(...) `
-| `Declaration` | `` | `Declaration(; attrs...)`
-| `ProcessingInstruction` | `` | `ProcessingInstruction(tag; attrs...)`
-| `Comment` | `` | `Comment(text)`
-| `CData` | `` | `CData(text)`
-| `Element` | ` children... ` | `Element(tag, children...; attrs...)`
-| `Text` | the `text` part of `text` | `Text(text)`
+```
+nodetype(node) -> XML.NodeType (an enum)
+tag(node) -> String or Nothing
+attributes(node) -> Dict{String, String} or Nothing
+value(node) -> String or Nothing
+children(node) -> Vector{Node}
+is_simple(node) -> Bool (e.g. text)
+simple_value(node) -> e.g. "text" from text
+```
-## `Node`: Probably What You're Looking For
+## `NodeType`
-- `read`-ing a `Node` loads the entire XML DOM in memory.
-- See the table above for convenience constructors.
-- `Node`s have some additional methods that aid in construction/mutation:
+Each item in an XML DOM is classified by its `NodeType`:
-```julia
-# Add a child:
-push!(parent::Node, child::Node)
+| NodeType | XML Representation | Constructor |
+|----------|--------------------|-------------|
+| `Document` | An entire document | `Document(children...)` |
+| `DTD` | `` | `DTD(...)` |
+| `Declaration` | `` | `Declaration(; attrs...)` |
+| `ProcessingInstruction` | `` | `ProcessingInstruction(tag; attrs...)` |
+| `Comment` | `` | `Comment(text)` |
+| `CData` | `` | `CData(text)` |
+| `Element` | ` children... ` | `Element(tag, children...; attrs...)` |
+| `Text` | the `text` part of `text` | `Text(text)` |
-# Replace a child:
-parent[2] = child
-
-# Add/change an attribute:
-node["key"] = value
-
-node["key"]
-```
+
-- `Node` is an immutable type. However, you can easily create a copy with one or more field values changed by using the `Node(::Node, children...; attrs...)` constructor where `children` are appended to the source node's children and `attrs` are appended to the node's attributes.
+## Mutation
```julia
-node = XML.Element("tag", "child")
-# Node Element (1 child)
+push!(parent, child) # Add a child
+parent[2] = child # Replace a child
+node["key"] = "value" # Add/change an attribute
+node["key"] # Get an attribute
+```
-simple_value(node)
-# "child"
+
-node2 = Node(node, "added"; id="my-id")
-# Node Element (2 children)
+## Tree Navigation
-node2.children
-# 2-element Vector{Node}:
-# Node Text "child"
-# Node Text "added"
+```julia
+depth(child, root) # Depth of child relative to root
+parent(child, root) # Parent of child within root's tree
+siblings(child, root) # Siblings of child within root's tree
```
-### Writing `Element` `Node`s with `XML.h`
+
+
+## Writing Elements with `XML.h`
Similar to [Cobweb.jl](https://github.com/JuliaComputing/Cobweb.jl#-creating-nodes-with-cobwebh), `XML.h` enables you to write elements with a simpler syntax:
```julia
using XML: h
-julia> node = h.parent(
- h.child("first child content", id="id1"),
- h.child("second child content", id="id2")
- )
+node = h.parent(
+ h.child("first child content", id="id1"),
+ h.child("second child content", id="id2")
+)
# Node Element (2 children)
-julia> print(XML.write(node))
+print(XML.write(node))
#
# first child content
# second child content
@@ -139,111 +105,95 @@ julia> print(XML.write(node))
-## `XML.LazyNode`: For Fast Iteration through an XML File
-
-A lazy data structure that just keeps track of the position in the raw data (`Vector{UInt8}`) to read from.
-
-- You can iterate over a `LazyNode` to "read" through an XML file:
-
-```julia
-doc = read(filename, LazyNode)
-
-foreach(println, doc)
-# LazyNode Declaration
-# LazyNode Element
-# LazyNode Element
-# LazyNode Element
-# LazyNode Text "Gambardella, Matthew"
-# LazyNode Element
-# ⋮
-```
-
-
-
# Reading
```julia
-# Reading from file:
+# From a file:
read(filename, Node)
-read(filename, LazyNode)
-
-# Parsing from string:
-parse(Node, str)
-parse(LazyNode, str)
+# From a string:
+parse(str, Node)
```
-
+
# Writing
```julia
XML.write(filename::String, node) # write to file
-
-XML.write(io::IO, node) # write to stream
-
-XML.write(node) # String
+XML.write(io::IO, node) # write to stream
+XML.write(node) # return String
```
+`XML.write` respects `xml:space="preserve"` on elements, suppressing automatic indentation.
-
-
-# Performance
-
-- XML.jl performs comparatively to [EzXML.jl](https://github.com/JuliaIO/EzXML.jl), which wraps the C library [libxml2](https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home).
-- See the `benchmarks/suite.jl` for the code to produce these results.
-- The following output was generated in a Julia session with the following `versioninfo`:
-
-```
-julia> versioninfo()
-Julia Version 1.9.4
-Commit 8e5136fa297 (2023-11-14 08:46 UTC)
-Build Info:
- Official https://julialang.org/ release
-Platform Info:
- OS: macOS (arm64-apple-darwin22.4.0)
- CPU: 10 × Apple M1 Pro
- WORD_SIZE: 64
- LIBM: libopenlibm
- LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
- Threads: 8 on 8 virtual cores
-```
+
+# XPath
-### Reading an XML File
+Query nodes using a subset of XPath 1.0 via `xpath(node, path)`:
-```
- XML.LazyNode 0.009583
- XML.Node ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 1071.32
- EzXML.readxml ■■■■■■■■■ 284.346
- XMLDict.xml_dict ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 1231.47
-```
+```julia
+doc = parse("""
+
+ hello
+ world
+
+""", Node)
+
+root = doc[end]
+
+xpath(root, "//b") # All descendants
+xpath(root, "a[@id='2']/b") # inside
+xpath(root, "a[1]") # First child
+xpath(root, "//b/text()") # Text nodes inside all s
+```
+
+### Supported syntax
+
+| Expression | Description |
+|------------|-------------|
+| `/` | Root / path separator |
+| `tag` | Child element by name |
+| `*` | Any child element |
+| `//` | Descendant-or-self (recursive) |
+| `.` | Current node |
+| `..` | Parent node |
+| `[n]` | Positional predicate (1-based) |
+| `[@attr]` | Has-attribute predicate |
+| `[@attr='v']` | Attribute-value predicate |
+| `text()` | Text node children |
+| `node()` | All node children |
+| `@attr` | Attribute value (returns strings) |
-### Writing an XML File
+
-```
- Write: XML ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 289.638
- Write: EzXML ■■■■■■■■■■■■■ 93.4631
-```
+# Streaming Tokenizer
-### Lazily Iterating over Each Node
-```
- LazyNode ■■■■■■■■■ 51.752
- EzXML.StreamReader ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 226.271
-```
+For large files or when you need fine-grained control, `XML.XMLTokenizer` provides a streaming tokenizer that yields tokens without building a DOM:
-### Collecting All Names/Tags in an XML File
-```
- XML.LazyNode ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 210.482
- EzXML.StreamReader ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 276.238
- EzXML.readxml ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 263.269
+```julia
+using XML.XMLTokenizer
+
+for token in tokenize("text")
+ println(token.kind, " => ", repr(token.raw))
+end
+# TOKEN_OPEN_TAG_START => " ">"
+# TOKEN_OPEN_TAG_START => " "attr"
+# TOKEN_ATTR_VALUE => "\"val\""
+# TOKEN_CLOSE_TAG => ">"
+# TOKEN_TEXT => "text"
+# TOKEN_END_TAG => ""
+# TOKEN_END_TAG => ""
```
-
-# Possible Gotchas
+# Escaping
+
+XML.jl doesn't automatically escape special characters (`<`, `>`, `&`, `"`, `'`) for you. Use the provided utility functions:
-- XML.jl doesn't automatically escape special characters (`<`, `>`, `&`, `"`, and `'` ) for you. However, we provide utility functions for doing the conversions back and forth:
- - `XML.escape(::String)` and `XML.unescape(::String)`
- - `XML.escape!(::Node)` and `XML.unescape!(::Node)`.
+- `XML.escape(::String)` / `XML.unescape(::String)` -- transform strings.
+- `XML.escape!(::Node)` / `XML.unescape!(::Node)` -- transform an entire node tree in-place.
diff --git a/benchmarks/Project.toml b/benchmarks/Project.toml
index ed90996..0598016 100644
--- a/benchmarks/Project.toml
+++ b/benchmarks/Project.toml
@@ -2,7 +2,7 @@
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
EzXML = "8f5d6c58-4d21-5cfd-889c-e3ad7ee6a615"
-OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
+LightXML = "9c8b4983-aa76-5018-a973-4c85ecc9e179"
UnicodePlots = "b8865327-cd53-5732-bb35-84acbb429228"
XML = "72c71f33-b9b6-44de-8c94-c961784809e2"
XMLDict = "228000da-037f-5747-90a9-8195ccbf91a5"
diff --git a/benchmarks/benchmarks.jl b/benchmarks/benchmarks.jl
new file mode 100644
index 0000000..aa558b9
--- /dev/null
+++ b/benchmarks/benchmarks.jl
@@ -0,0 +1,99 @@
+using XML
+using XML: Element, nodetype, tag, children
+using EzXML: EzXML
+using XMLDict: XMLDict
+using LightXML: LightXML
+using BenchmarkTools
+using DataFrames
+using UnicodePlots
+
+BenchmarkTools.DEFAULT_PARAMETERS.seconds = 10
+BenchmarkTools.DEFAULT_PARAMETERS.samples = 20000
+
+#-----------------------------------------------------------------------------# Test data
+# Small file (~120 lines)
+small_file = joinpath(@__DIR__, "..", "test", "data", "books.xml")
+small_xml = read(small_file, String)
+
+df = DataFrame(kind=String[], name=String[], bench=BenchmarkTools.Trial[])
+
+macro add_benchmark(kind, name, expr...)
+ esc(:(let
+ @info string($kind, " - ", $name)
+ bench = @benchmark $(expr...)
+ push!(df, (; kind=$kind, name=$name, bench))
+ end))
+end
+
+#-----------------------------------------------------------------------------# Parse (small)
+@add_benchmark "Parse (small)" "XML.jl" parse($small_xml, Node)
+@add_benchmark "Parse (small)" "EzXML" EzXML.parsexml($small_xml)
+@add_benchmark "Parse (small)" "LightXML" LightXML.parse_string($small_xml)
+@add_benchmark "Parse (small)" "XMLDict" XMLDict.xml_dict($small_xml)
+
+#-----------------------------------------------------------------------------# Write (small)
+@add_benchmark "Write (small)" "XML.jl" XML.write(o) setup=(o = parse(small_xml, Node))
+@add_benchmark "Write (small)" "EzXML" sprint(print, o) setup=(o = EzXML.parsexml(small_xml))
+@add_benchmark "Write (small)" "LightXML" LightXML.save_file(o, f) setup=(o = LightXML.parse_string(small_xml); f = tempname()) teardown=(LightXML.free(o); rm(f, force=true))
+
+#-----------------------------------------------------------------------------# Collect element tags
+function xml_collect_tags(node)
+ out = String[]
+ _xml_collect_tags!(out, node)
+ out
+end
+function _xml_collect_tags!(out, node)
+ for c in children(node)
+ if nodetype(c) === Element
+ push!(out, tag(c))
+ _xml_collect_tags!(out, c)
+ end
+ end
+end
+
+function ezxml_collect_tags(node::EzXML.Node)
+ out = String[]
+ _ezxml_collect_tags!(out, node)
+ out
+end
+function _ezxml_collect_tags!(out, node::EzXML.Node)
+ for child in EzXML.eachelement(node)
+ push!(out, child.name)
+ _ezxml_collect_tags!(out, child)
+ end
+end
+
+function lightxml_collect_tags(root::LightXML.XMLElement)
+ out = String[]
+ _lightxml_collect_tags!(out, root)
+ out
+end
+function _lightxml_collect_tags!(out, el::LightXML.XMLElement)
+ for child in LightXML.child_elements(el)
+ push!(out, LightXML.name(child))
+ _lightxml_collect_tags!(out, child)
+ end
+end
+
+@add_benchmark "Collect tags" "XML.jl" xml_collect_tags(o) setup=(o = parse(small_xml, Node))
+@add_benchmark "Collect tags" "EzXML" ezxml_collect_tags(o.root) setup=(o = EzXML.parsexml(small_xml))
+@add_benchmark "Collect tags" "LightXML" lightxml_collect_tags(LightXML.root(o)) setup=(o = LightXML.parse_string(small_xml)) teardown=(LightXML.free(o))
+
+#-----------------------------------------------------------------------------# Results
+function plot_group(df, kind)
+ g = groupby(df, :kind)
+ haskey(g, (;kind)) || return
+ sub = g[(;kind)]
+ x = map(row -> "$(row.name)", eachrow(sub))
+ y = map(x -> median(x).time / 1e6, sub.bench)
+ display(barplot(x, y, title = "$kind — median time (ms)", border=:none, width=50))
+ println()
+end
+
+println("\n", "="^60)
+println(" BENCHMARK RESULTS")
+println("="^60, "\n")
+
+for kind in unique(df.kind)
+ plot_group(df, kind)
+end
diff --git a/benchmarks/suite.jl b/benchmarks/suite.jl
deleted file mode 100644
index e06dc61..0000000
--- a/benchmarks/suite.jl
+++ /dev/null
@@ -1,74 +0,0 @@
-using Pkg
-Pkg.activate(@__DIR__)
-
-using XML
-using EzXML: EzXML
-using XMLDict: XMLDict
-using BenchmarkTools
-using DataFrames
-using UnicodePlots
-using OrderedCollections: OrderedDict
-
-
-BenchmarkTools.DEFAULT_PARAMETERS.seconds = 10
-BenchmarkTools.DEFAULT_PARAMETERS.samples = 20000
-
-
-# nasa.xml was downloaded from:
-# http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html#nasa
-file = joinpath(@__DIR__, "nasa.xml")
-
-df = DataFrame(kind=String[], name=String[], bench=BenchmarkTools.Trial[])
-
-macro add_benchmark(kind, name, expr...)
- esc(:(let
- @info string($kind, " - ", $name)
- bench = @benchmark $(expr...)
- push!(df, (; kind=$kind, name=$name, bench))
- end))
-end
-
-#-----------------------------------------------------------------------------# Write
-@add_benchmark "Write" "XML.write" XML.write($(tempname()), o) setup = (o = read(file, Node))
-@add_benchmark "Write" "EzXML.writexml" EzXML.write($(tempname()), o) setup = (o = EzXML.readxml(file))
-
-#-----------------------------------------------------------------------------# Read
-@add_benchmark "Read" "XML.LazyNode" read($file, LazyNode)
-@add_benchmark "Read" "XML.Node" read($file, Node)
-@add_benchmark "Read" "EzXML.readxml" EzXML.readxml($file)
-@add_benchmark "Read" "XMLDict.xml_dict" XMLDict.xml_dict(read($file, String))
-
-#-----------------------------------------------------------------------------# Lazy Iteration
-@add_benchmark "Lazy Iteration" "LazyNode" for x in read($file, LazyNode); end
-@add_benchmark "Lazy Iteration" "EzXML.StreamReader" (reader = open(EzXML.StreamReader, $file); for x in reader; end; close(reader))
-
-#-----------------------------------------------------------------------------# Lazy Iteration: Collect Tags
-@add_benchmark "Collect Tags" "LazyNode" [tag(x) for x in o] setup = (o = read(file, LazyNode))
-@add_benchmark "Collect Tags" "EzXML.StreamReader" [r.name for x in r if x == EzXML.READER_ELEMENT] setup=(r=open(EzXML.StreamReader, file)) teardown=(close(r))
-
-function get_tags(o::EzXML.Node)
- out = String[]
- for node in EzXML.eachelement(o)
- push!(out, node.name)
- for tag in get_tags(node)
- push!(out, tag)
- end
- end
- out
-end
-@add_benchmark "Collect Tags" "EzXML.readxml" get_tags(o.root) setup=(o = EzXML.readxml(file))
-
-
-#-----------------------------------------------------------------------------# Plots
-function plot(df, kind)
- g = groupby(df, :kind)
- sub = g[(;kind)]
- x = map(row -> "$(row.name)", eachrow(sub))
- y = map(x -> median(x).time / 1000^2, sub.bench)
- display(barplot(x, y, title = "$kind Time (ms)", border=:none, width=50))
-end
-
-plot(df, "Read")
-plot(df, "Write")
-plot(df, "Lazy Iteration")
-plot(df, "Collect Tags")
diff --git a/src/XML.jl b/src/XML.jl
index 273bfda..8954140 100644
--- a/src/XML.jl
+++ b/src/XML.jl
@@ -1,31 +1,72 @@
module XML
-using Mmap
-using OrderedCollections: OrderedDict
+include("tokenizer.jl")
+using .XMLTokenizer
export
- # Core Types:
- Node, LazyNode,
- # Interface:
- children, nodetype, tag, attributes, value, is_simple, simplevalue, simple_value,
- # Extended Interface for LazyNode:
- parent, depth, next, prev
+ Node, NodeType,
+ CData, Comment, Declaration, Document, DTD, Element, ProcessingInstruction, Text,
+ nodetype, tag, attributes, value, children,
+ is_simple, simple_value,
+ depth, siblings,
+ xpath,
+ h
#-----------------------------------------------------------------------------# escape/unescape
-const escape_chars = ('&' => "&", '<' => "<", '>' => ">", "'" => "'", '"' => """)
+const escape_chars = ('&' => "&", '<' => "<", '>' => ">", '\'' => "'", '"' => """)
+
+"""
+ escape(x::AbstractString) -> String
+
+Escape the five XML predefined entities: `&` `<` `>` `'` `"`.
+
+!!! note "Changed in v0.4"
+ `escape` is no longer idempotent. In previous versions, already-escaped sequences like
+ `&` were left untouched. Now every `&` is escaped, so `escape("&")` produces
+ `"&"`. Call `escape` only on raw, unescaped text.
+"""
+escape(x::AbstractString) = replace(x, escape_chars...)
+
function unescape(x::AbstractString)
- result = x
- for (pat, r) in reverse.(escape_chars)
- result = replace(result, pat => r)
- end
- return result
-end
-function escape(x::String)
- result = replace(x, r"&(?!amp;|quot;|apos;|gt;|lt;)" => "&")
- for (pat, r) in escape_chars[2:end]
- result = replace(result, pat => r)
+ occursin('&', x) || return string(x)
+ s = string(x)
+ io = IOBuffer(sizehint=ncodeunits(s))
+ i = 1
+ while i <= ncodeunits(s)
+ if s[i] == '&'
+ j = findnext(';', s, i + 1)
+ if !isnothing(j)
+ ref = SubString(s, i, j)
+ if ref == "&"
+ print(io, '&')
+ elseif ref == "<"
+ print(io, '<')
+ elseif ref == ">"
+ print(io, '>')
+ elseif ref == "'"
+ print(io, '\'')
+ elseif ref == """
+ print(io, '"')
+ elseif startswith(ref, "")
+ is_hex = length(ref) > 3 && (ref[3] == 'x' || ref[3] == 'X')
+ digits = SubString(s, i + (is_hex ? 3 : 2), j - 1)
+ cp = tryparse(UInt32, digits; base = is_hex ? 16 : 10)
+ if !isnothing(cp) && isvalid(Char, cp)
+ print(io, Char(cp))
+ else
+ print(io, ref)
+ end
+ else
+ print(io, ref)
+ end
+ i = j + 1
+ continue
+ end
+ end
+ print(io, s[i])
+ i = nextind(s, i)
end
- return result
+ String(take!(io))
end
#-----------------------------------------------------------------------------# NodeType
@@ -34,9 +75,9 @@ end
- Document # prolog & root Element
- DTD #
- Declaration #
- - ProcessingInstruction #
+ - ProcessingInstruction #
- Comment #
- - CData #
+ - CData #
- Element # children...
- Text # text
@@ -45,381 +86,1261 @@ NodeTypes can be used to construct XML.Nodes:
Document(children...)
DTD(value)
Declaration(; attributes)
- ProcessingInstruction(tag, attributes)
+ ProcessingInstruction(tag, content)
Comment(text)
CData(text)
Element(tag, children...; attributes)
Text(text)
"""
-@enum(NodeType, CData, Comment, Declaration, Document, DTD, Element, ProcessingInstruction, Text)
+@enum NodeType::UInt8 CData Comment Declaration Document DTD Element ProcessingInstruction Text
+#-----------------------------------------------------------------------------# Node
+struct Node{S}
+ nodetype::NodeType
+ tag::Union{Nothing, S}
+ attributes::Union{Nothing, Vector{Pair{S, S}}}
+ value::Union{Nothing, S}
+ children::Union{Nothing, Vector{Node{S}}}
-#-----------------------------------------------------------------------------# includes
-include("raw.jl")
-include("dtd.jl")
+ function Node{S}(nodetype::NodeType, tag, attributes, value, children) where {S}
+ if nodetype in (Text, Comment, CData, DTD)
+ isnothing(tag) && isnothing(attributes) && !isnothing(value) && isnothing(children) ||
+ error("$nodetype nodes only accept a value.")
+ elseif nodetype === Element
+ !isnothing(tag) && isnothing(value) ||
+ error("Element nodes require a tag and no value.")
+ elseif nodetype === Declaration
+ isnothing(tag) && isnothing(value) && isnothing(children) ||
+ error("Declaration nodes only accept attributes.")
+ elseif nodetype === ProcessingInstruction
+ !isnothing(tag) && isnothing(attributes) && isnothing(children) ||
+ error("ProcessingInstruction nodes require a tag and only accept a value.")
+ elseif nodetype === Document
+ isnothing(tag) && isnothing(attributes) && isnothing(value) ||
+ error("Document nodes only accept children.")
+ end
+ new{S}(nodetype, tag, attributes, value, children)
+ end
+end
-abstract type AbstractXMLNode end
+#-----------------------------------------------------------------------------# interface
+nodetype(o::Node) = o.nodetype
+tag(o::Node) = o.tag
-#-----------------------------------------------------------------------------# LazyNode
"""
- LazyNode(file::AbstractString)
- LazyNode(data::XML.Raw)
+ attributes(node::Node) -> Union{Nothing, Dict{String, String}}
-A Lazy representation of an XML node.
+Return the attributes of an `Element` or `Declaration` node as a `Dict`, or `nothing` if the
+node has no attributes.
+
+!!! note "Changed in v0.4"
+ In previous versions, `attributes` returned an `OrderedDict` from OrderedCollections.jl.
+ It now returns a standard `Dict`. Attribute order is preserved internally but not exposed
+ by this function. Use `node["key"]` for key-based access and `keys(node)` for ordered keys.
"""
-mutable struct LazyNode <: AbstractXMLNode
- raw::Raw
- tag::Union{Nothing, String}
- attributes::Union{Nothing, OrderedDict{String, String}}
- value::Union{Nothing, String}
-end
-LazyNode(raw::Raw) = LazyNode(raw, nothing, nothing, nothing)
+attributes(o::Node) = isnothing(o.attributes) ? nothing : Dict(o.attributes)
-function Base.getproperty(o::LazyNode, x::Symbol)
- x === :raw && return getfield(o, :raw)
- x === :nodetype && return nodetype(o.raw)
- x === :tag && return isnothing(getfield(o, x)) ? setfield!(o, x, tag(o.raw)) : getfield(o, x)
- x === :attributes && return isnothing(getfield(o, x)) ? setfield!(o, x, attributes(o.raw)) : getfield(o, x)
- x === :value && return isnothing(getfield(o, x)) ? setfield!(o, x, value(o.raw)) : getfield(o, x)
- x === :depth && return depth(o.raw)
- x === :children && return LazyNode.(children(o.raw))
- error("type LazyNode has no field $(x)")
-end
-Base.propertynames(o::LazyNode) = (:raw, :nodetype, :tag, :attributes, :value, :depth, :children)
+value(o::Node) = o.value
+children(o::Node) = something(o.children, ())
+
+is_simple(o::Node) = o.nodetype === Element &&
+ (isnothing(o.attributes) || isempty(o.attributes)) &&
+ !isnothing(o.children) && length(o.children) == 1 &&
+ o.children[1].nodetype in (Text, CData)
-Base.show(io::IO, o::LazyNode) = _show_node(io, o)
+simple_value(o::Node) = is_simple(o) ? o.children[1].value :
+ error("`simple_value` is only defined for simple nodes.")
-Base.read(io::IO, ::Type{LazyNode}) = LazyNode(read(io, Raw))
-Base.read(filename::AbstractString, ::Type{LazyNode}) = LazyNode(read(filename, Raw))
-Base.parse(x::AbstractString, ::Type{LazyNode}) = LazyNode(parse(x, Raw))
+#-----------------------------------------------------------------------------# tree navigation
-children(o::LazyNode) = LazyNode.(children(o.raw))
-parent(o::LazyNode) = LazyNode(parent(o.raw))
-depth(o::LazyNode) = depth(o.raw)
+"""
+ parent(child::Node, root::Node) -> Node
+
+Return the parent of `child` within the tree rooted at `root`.
-Base.IteratorSize(::Type{LazyNode}) = Base.SizeUnknown()
-Base.eltype(::Type{LazyNode}) = LazyNode
+Since `Node` does not store parent pointers, this performs a tree search from `root`.
+Throws an error if `child` is not found or if `child === root`.
+"""
+function Base.parent(child::Node, root::Node)
+ child === root && error("Root node has no parent.")
+ result = _find_parent(child, root)
+ isnothing(result) && error("Node not found in tree.")
+ result
+end
-function Base.iterate(o::LazyNode, state=o)
- n = next(state)
- return isnothing(n) ? nothing : (n, n)
+function _find_parent(child::Node, current::Node)
+ for c in children(current)
+ c === child && return current
+ result = _find_parent(child, c)
+ isnothing(result) || return result
+ end
+ nothing
end
-function next(o::LazyNode)
- n = next(o.raw)
- isnothing(n) && return nothing
- n.type === RawElementClose ? next(LazyNode(n)) : LazyNode(n)
+"""
+ depth(child::Node, root::Node) -> Int
+
+Return the depth of `child` within the tree rooted at `root` (root has depth 0).
+
+Since `Node` does not store parent pointers, this performs a tree search from `root`.
+Throws an error if `child` is not found in the tree.
+"""
+function depth(child::Node, root::Node)
+ child === root && return 0
+ result = _find_depth(child, root, 0)
+ isnothing(result) && error("Node not found in tree.")
+ result
end
-function prev(o::LazyNode)
- n = prev(o.raw)
- isnothing(n) && return nothing
- n.type === RawElementClose ? prev(LazyNode(n)) : LazyNode(n)
+
+function _find_depth(child::Node, current::Node, d::Int)
+ for c in children(current)
+ c === child && return d + 1
+ result = _find_depth(child, c, d + 1)
+ isnothing(result) || return result
+ end
+ nothing
end
-#-----------------------------------------------------------------------------# Node
"""
- Node(nodetype, tag, attributes, value, children)
- Node(node::Node; kw...) # copy node with keyword overrides
- Node(node::LazyNode) # un-lazy the LazyNode
+ siblings(child::Node, root::Node) -> Vector{Node}
-A representation of an XML DOM node. For simpler construction, use `(::NodeType)(args...)`
+Return the siblings of `child` (other children of the same parent) within the tree rooted
+at `root`. The returned vector does not include `child` itself.
+
+Throws an error if `child` is the root or is not found in the tree.
"""
-struct Node <: AbstractXMLNode
- nodetype::NodeType
- tag::Union{Nothing, String}
- attributes::Union{Nothing, OrderedDict{String, String}}
- value::Union{Nothing, String}
- children::Union{Nothing, Vector{Node}}
-
- function Node(nodetype::NodeType, tag=nothing, attributes=nothing, value=nothing, children=nothing)
- new(nodetype,
- isnothing(tag) ? nothing : string(tag),
- isnothing(attributes) ? nothing : OrderedDict(string(k) => string(v) for (k, v) in pairs(attributes)),
- isnothing(value) ? nothing : string(value),
- isnothing(children) ? nothing :
- children isa Node ? [children] :
- children isa Vector{Node} ? children :
- children isa Vector ? map(Node, children) :
- children isa Tuple ? map(Node, collect(children)) :
- [Node(children)]
- )
+function siblings(child::Node, root::Node)
+ p = parent(child, root)
+ [c for c in children(p) if c !== child]
+end
+
+include("xpath.jl")
+
+#-----------------------------------------------------------------------------# _to_node
+_to_node(n::Node{String}) = n
+_to_node(n::Node) = throw(ArgumentError("Expected Node{String}, got $(typeof(n))"))
+_to_node(x) = Node{String}(Text, nothing, nothing, string(x), nothing)
+
+#-----------------------------------------------------------------------------# NodeType constructors
+function (T::NodeType)(args...; attrs...)
+ S = String
+ if T in (Text, Comment, CData, DTD)
+ length(args) == 1 || error("$T nodes require exactly one value argument.")
+ !isempty(attrs) && error("$T nodes do not accept attributes.")
+ Node{S}(T, nothing, nothing, string(only(args)), nothing)
+ elseif T === Element
+ isempty(args) && error("Element nodes require at least a tag.")
+ t = string(first(args))
+ a = Pair{S,S}[String(k) => String(v) for (k, v) in pairs(attrs)]
+ c = Node{S}[_to_node(x) for x in args[2:end]]
+ Node{S}(T, t, a, nothing, c)
+ elseif T === Declaration
+ !isempty(args) && error("Declaration nodes only accept keyword attributes.")
+ a = isempty(attrs) ? nothing : [String(k) => String(v) for (k, v) in pairs(attrs)]
+ Node{S}(T, nothing, a, nothing, nothing)
+ elseif T === ProcessingInstruction
+ length(args) >= 1 || error("ProcessingInstruction nodes require a target.")
+ length(args) <= 2 || error("ProcessingInstruction nodes accept a target and optional content.")
+ !isempty(attrs) && error("ProcessingInstruction nodes do not accept attributes.")
+ t = string(args[1])
+ v = length(args) == 2 ? string(args[2]) : nothing
+ Node{S}(T, t, nothing, v, nothing)
+ elseif T === Document
+ !isempty(attrs) && error("Document nodes do not accept attributes.")
+ c = Node{S}[_to_node(x) for x in args]
+ Node{S}(T, nothing, nothing, nothing, c)
end
end
-function Node(o::Node, x...; kw...)
- attrs = !isnothing(kw) ?
- merge(
- OrderedDict(string(k) => string(v) for (k, v) in pairs(kw)),
- isnothing(o.attributes) ? OrderedDict{String,String}() : o.attributes
- ) :
- o.attributes
- children = isempty(x) ? o.children : vcat(isnothing(o.children) ? [] : o.children, collect(x))
- Node(o.nodetype, o.tag, attrs, o.value, children)
+#-----------------------------------------------------------------------------# equality
+_eq(::Nothing, ::Nothing) = true
+_eq(::Nothing, b) = isempty(b)
+_eq(a, ::Nothing) = isempty(a)
+_eq(a, b) = a == b
+
+# Attribute equality is order-insensitive per XML spec
+function _attrs_eq(a, b)
+ a_empty = isnothing(a) || isempty(a)
+ b_empty = isnothing(b) || isempty(b)
+ a_empty && b_empty && return true
+ (a_empty != b_empty) && return false
+ length(a) != length(b) && return false
+ for p in a
+ p in b || return false
+ end
+ true
end
-function Node(node::LazyNode)
- nodetype = node.nodetype
- tag = node.tag
- attributes = node.attributes
- value = node.value
- c = XML.children(node)
- Node(nodetype, tag, attributes, value, isempty(c) ? nothing : map(Node, c))
+function Base.:(==)(a::Node, b::Node)
+ a.nodetype == b.nodetype &&
+ a.tag == b.tag &&
+ _attrs_eq(a.attributes, b.attributes) &&
+ a.value == b.value &&
+ _eq(a.children, b.children)
end
-Node(data::Raw) = Node(LazyNode(data))
+#-----------------------------------------------------------------------------# indexing
+Base.getindex(o::Node, i::Integer) = children(o)[i]
+Base.getindex(o::Node, ::Colon) = children(o)
+Base.lastindex(o::Node) = lastindex(children(o))
+Base.only(o::Node) = only(children(o))
+Base.length(o::Node) = length(children(o))
-# Anything that's not Vector{UInt8} or a (Lazy)Node is converted to a Text Node
-Node(x) = Node(Text, nothing, nothing, string(x), nothing)
+function Base.get(o::Node, key::AbstractString, default)
+ isnothing(o.attributes) && return default
+ for (k, v) in o.attributes
+ k == key && return v
+ end
+ default
+end
-h(tag::Union{Symbol, String}, children...; kw...) = Node(Element, tag, kw, nothing, children)
-Base.getproperty(::typeof(h), tag::Symbol) = h(tag)
-(o::Node)(children...; kw...) = Node(o, Node.(children)...; kw...)
+const _MISSING_ATTR = gensym(:missing_attr)
-# NOT in-place for Text Nodes
-function escape!(o::Node, warn::Bool=true)
- if o.nodetype == Text
- warn && @warn "escape!() called on a Text Node creates a new node."
- return Text(escape(o.value))
+function Base.getindex(o::Node, key::AbstractString)
+ val = get(o, key, _MISSING_ATTR)
+ val === _MISSING_ATTR && throw(KeyError(key))
+ val
+end
+
+function Base.haskey(o::Node, key::AbstractString)
+ get(o, key, _MISSING_ATTR) !== _MISSING_ATTR
+end
+
+Base.keys(o::Node) = isnothing(o.attributes) ? () : first.(o.attributes)
+
+#-----------------------------------------------------------------------------# mutation
+function Base.setindex!(o::Node, val, i::Integer)
+ isnothing(o.children) && error("Node has no children.")
+ o.children[i] = _to_node(val)
+end
+
+function Base.setindex!(o::Node, val, key::AbstractString)
+ isnothing(o.attributes) && error("Node has no attributes.")
+ v = string(val)
+ for i in eachindex(o.attributes)
+ if first(o.attributes[i]) == key
+ o.attributes[i] = key => v
+ return v
+ end
end
- isnothing(o.children) && return o
- map!(x -> escape!(x, false), o.children, o.children)
- o
+ push!(o.attributes, key => v)
+ v
end
-function unescape!(o::Node, warn::Bool=true)
- if o.nodetype == Text
- warn && @warn "unescape!() called on a Text Node creates a new node."
- return Text(unescape(o.value))
+
+function Base.push!(a::Node, b)
+ isnothing(a.children) && error("Node does not accept children.")
+ push!(a.children, _to_node(b))
+ a
+end
+
+function Base.pushfirst!(a::Node, b)
+ isnothing(a.children) && error("Node does not accept children.")
+ pushfirst!(a.children, _to_node(b))
+ a
+end
+
+#-----------------------------------------------------------------------------# show (REPL)
+function Base.show(io::IO, o::Node)
+ nt = o.nodetype
+ printstyled(io, nt; color=:light_green)
+ if nt === Text
+ printstyled(io, ' ', repr(o.value))
+ elseif nt === Element
+ printstyled(io, " <", o.tag; color=:light_cyan)
+ if !isnothing(o.attributes)
+ for (k, v) in o.attributes
+ print(io, ' ', k, '=', '"', v, '"')
+ end
+ end
+ printstyled(io, '>'; color=:light_cyan)
+ n = length(children(o))
+ n > 0 && printstyled(io, n == 1 ? " (1 child)" : " ($n children)"; color=:light_black)
+ elseif nt === DTD
+ printstyled(io, " '; color=:light_cyan)
+ elseif nt === Declaration
+ printstyled(io, " "; color=:light_cyan)
+ elseif nt === ProcessingInstruction
+ printstyled(io, " ", o.tag; color=:light_cyan)
+ !isnothing(o.value) && print(io, ' ', o.value)
+ printstyled(io, "?>"; color=:light_cyan)
+ elseif nt === Comment
+ printstyled(io, " "; color=:light_cyan)
+ elseif nt === CData
+ printstyled(io, " "; color=:light_cyan)
+ elseif nt === Document
+ n = length(children(o))
+ n > 0 && printstyled(io, n == 1 ? " (1 child)" : " ($n children)"; color=:light_black)
end
- isnothing(o.children) && return o
- map!(x -> unescape!(x, false), o.children, o.children)
- o
end
+#-----------------------------------------------------------------------------# show (text/xml)
+function _print_attrs(io::IO, attributes)
+ isnothing(attributes) && return
+ for (k, v) in attributes
+ print(io, ' ', k, '=', '"', escape(v), '"')
+ end
+end
-Base.read(filename::AbstractString, ::Type{Node}) = Node(read(filename, Raw))
-Base.read(io::IO, ::Type{Node}) = Node(read(io, Raw))
-Base.parse(x::AbstractString, ::Type{Node}) = Node(parse(x, Raw))
+function _write_xml(io::IO, node::Node, depth::Int=0, indent::Int=2, preserve::Bool=false)
+ pad = preserve ? "" : ' ' ^ (indent * depth)
+ nt = node.nodetype
+ if nt === Text
+ print(io, escape(node.value))
+ elseif nt === Element
+ # Check xml:space on this element
+ child_preserve = preserve
+ if !isnothing(node.attributes)
+ for (k, v) in node.attributes
+ k == "xml:space" && (child_preserve = v == "preserve")
+ end
+ end
+ print(io, pad, '<', node.tag)
+ _print_attrs(io, node.attributes)
+ ch = node.children
+ if isnothing(ch) || isempty(ch)
+ print(io, "/>")
+ elseif length(ch) == 1 && only(ch).nodetype === Text
+ print(io, '>')
+ _write_xml(io, only(ch), 0, 0, child_preserve)
+ print(io, "", node.tag, '>')
+ else
+ child_preserve ? print(io, '>') : println(io, '>')
+ for child in ch
+ _write_xml(io, child, depth + 1, indent, child_preserve)
+ child_preserve || println(io)
+ end
+ print(io, child_preserve ? "" : pad, "", node.tag, '>')
+ end
+ elseif nt === Declaration
+ print(io, pad, "")
+ elseif nt === ProcessingInstruction
+ print(io, pad, "", node.tag)
+ isnothing(node.value) || print(io, ' ', node.value)
+ print(io, "?>")
+ elseif nt === Comment
+ print(io, pad, "")
+ elseif nt === CData
+ print(io, pad, "")
+ elseif nt === DTD
+ print(io, pad, "')
+ elseif nt === Document
+ ch = node.children
+ if !isnothing(ch)
+ for (i, child) in enumerate(ch)
+ _write_xml(io, child, 0, indent, preserve)
+ i < length(ch) && println(io)
+ end
+ end
+ end
+end
-Base.setindex!(o::Node, val, i::Integer) = o.children[i] = Node(val)
-Base.push!(a::Node, b::Node) = push!(a.children, b)
-Base.pushfirst!(a::Node, b::Node) = pushfirst!(a.children, b)
+Base.show(io::IO, ::MIME"text/xml", node::Node) = _write_xml(io, node)
-Base.setindex!(o::Node, val, key::AbstractString) = (o.attributes[key] = string(val))
-Base.getindex(o::Node, val::AbstractString) = o.attributes[val]
-Base.haskey(o::Node, key::AbstractString) = isnothing(o.attributes) ? false : haskey(o.attributes, key)
-Base.keys(o::Node) = isnothing(o.attributes) ? () : keys(o.attributes)
+#-----------------------------------------------------------------------------# write / read
+write(node::Node; indentsize::Int=2) = (io = IOBuffer(); _write_xml(io, node, 0, indentsize); String(take!(io)))
+write(filename::AbstractString, node::Node; kw...) = open(io -> write(io, node; kw...), filename, "w")
+write(io::IO, node::Node; indentsize::Int=2) = _write_xml(io, node, 0, indentsize)
-Base.show(io::IO, o::Node) = _show_node(io, o)
+Base.read(filename::AbstractString, ::Type{Node}) = parse(read(filename, String), Node)
+Base.read(io::IO, ::Type{Node}) = parse(read(io, String), Node)
-#-----------------------------------------------------------------------------# Node Constructors
-function (T::NodeType)(args...; attr...)
- if T === Document
- !isempty(attr) && error("Document nodes do not have attributes.")
- Node(T, nothing, nothing, nothing, args)
- elseif T === DTD
- !isempty(attr) && error("DTD nodes only accept a value.")
- length(args) > 1 && error("DTD nodes only accept a value.")
- Node(T, nothing, nothing, only(args))
- elseif T === Declaration
- !isempty(args) && error("Declaration nodes only accept attributes")
- Node(T, nothing, attr)
- elseif T === ProcessingInstruction
- length(args) == 1 || error("ProcessingInstruction nodes require a tag and attributes.")
- Node(T, only(args), attr)
- elseif T === Comment
- !isempty(attr) && error("Comment nodes do not have attributes.")
- length(args) > 1 && error("Comment nodes only accept a single input.")
- Node(T, nothing, nothing, only(args))
- elseif T === CData
- !isempty(attr) && error("CData nodes do not have attributes.")
- length(args) > 1 && error("CData nodes only accept a single input.")
- Node(T, nothing, nothing, only(args))
- elseif T === Text
- !isempty(attr) && error("Text nodes do not have attributes.")
- length(args) > 1 && error("Text nodes only accept a single input.")
- Node(T, nothing, nothing, only(args))
- elseif T === Element
- tag = first(args)
- Node(T, tag, attr, nothing, args[2:end])
- else
- error("Unreachable reached while trying to create a Node via (::NodeType)(args...; kw...).")
+#-----------------------------------------------------------------------------# parse
+Base.parse(::Type{Node}, xml::AbstractString) = parse(xml, Node)
+
+function Base.parse(xml::AbstractString, ::Type{Node})
+ _parse(String(xml), String, unescape)
+end
+
+function Base.parse(xml::AbstractString, ::Type{Node{SubString{String}}})
+ _parse(String(xml), SubString{String}, identity)
+end
+
+_to(::Type{String}, s::AbstractString) = String(s)
+_to(::Type{SubString{String}}, s::SubString{String}) = s
+
+function _parse(xml::String, ::Type{S}, convert_text::F) where {S, F}
+ tags = S[]
+ attrs_stack = Vector{Union{Nothing, Vector{Pair{S,S}}}}()
+ children_stack = Vector{Vector{Node{S}}}()
+ push!(children_stack, Node{S}[])
+
+ pending_attr_name = SubString(xml, 1, 0)
+ decl_attrs = nothing
+ pending_pi_tag = SubString(xml, 1, 0)
+ pending_pi_value = nothing
+ in_close_tag = false
+
+ for token in tokenize(xml)
+ k = token.kind
+
+ if k === TOKEN_TEXT
+ push!(last(children_stack), Node{S}(Text, nothing, nothing, convert_text(token.raw), nothing))
+
+ elseif k === TOKEN_OPEN_TAG
+ push!(tags, _to(S, tag_name(token)))
+ push!(attrs_stack, nothing)
+ push!(children_stack, Node{S}[])
+
+ elseif k === TOKEN_SELF_CLOSE
+ t = pop!(tags)
+ a = pop!(attrs_stack)
+ pop!(children_stack)
+ push!(last(children_stack), Node{S}(Element, t, a, nothing, nothing))
+
+ elseif k === TOKEN_TAG_CLOSE
+ in_close_tag && (in_close_tag = false)
+
+ elseif k === TOKEN_CLOSE_TAG
+ close_name = tag_name(token)
+ isempty(tags) && error("Closing tag $close_name> with no matching open tag.")
+ t = pop!(tags)
+ t == close_name || error("Mismatched tags: expected $t>, got $close_name>.")
+ a = pop!(attrs_stack)
+ c = pop!(children_stack)
+ push!(last(children_stack), Node{S}(Element, t, a, nothing, isempty(c) ? nothing : c))
+ in_close_tag = true
+
+ elseif k === TOKEN_ATTR_NAME
+ pending_attr_name = token.raw
+
+ elseif k === TOKEN_ATTR_VALUE
+ val = convert_text(attr_value(token))
+ name = _to(S, pending_attr_name)
+ if decl_attrs !== nothing
+ any(p -> first(p) == name, decl_attrs) && error("Duplicate attribute: $name")
+ push!(decl_attrs, name => val)
+ elseif !isempty(attrs_stack)
+ if isnothing(last(attrs_stack))
+ attrs_stack[end] = Pair{S,S}[]
+ end
+ any(p -> first(p) == name, last(attrs_stack)) && error("Duplicate attribute: $name")
+ push!(last(attrs_stack), name => val)
+ end
+
+ elseif k === TOKEN_XML_DECL_OPEN
+ decl_attrs = Pair{S,S}[]
+
+ elseif k === TOKEN_XML_DECL_CLOSE
+ a = isempty(decl_attrs) ? nothing : decl_attrs
+ push!(last(children_stack), Node{S}(Declaration, nothing, a, nothing, nothing))
+ decl_attrs = nothing
+
+ elseif k === TOKEN_COMMENT_CONTENT
+ push!(last(children_stack), Node{S}(Comment, nothing, nothing, _to(S, token.raw), nothing))
+
+ elseif k === TOKEN_CDATA_CONTENT
+ push!(last(children_stack), Node{S}(CData, nothing, nothing, _to(S, token.raw), nothing))
+
+ elseif k === TOKEN_DOCTYPE_CONTENT
+ push!(last(children_stack), Node{S}(DTD, nothing, nothing, _to(S, lstrip(token.raw)), nothing))
+
+ elseif k === TOKEN_PI_OPEN
+ pending_pi_tag = pi_target(token)
+ pending_pi_value = nothing
+
+ elseif k === TOKEN_PI_CONTENT
+ content = strip(token.raw)
+ pending_pi_value = isempty(content) ? nothing : _to(S, content)
+
+ elseif k === TOKEN_PI_CLOSE
+ push!(last(children_stack), Node{S}(ProcessingInstruction, _to(S, pending_pi_tag), nothing, pending_pi_value, nothing))
+ end
end
+
+ !isempty(tags) && error("Unclosed tags: $(join(tags, ", "))")
+ doc_children = only(children_stack)
+ Node{S}(Document, nothing, nothing, nothing, isempty(doc_children) ? nothing : doc_children)
end
-#-----------------------------------------------------------------------------# !!! common !!!
-# Everything below here is common to all data structures
+#-----------------------------------------------------------------------------# h (HTML/XML element builder)
+"""
+ h(tag, children...; attrs...)
+ h.tag(children...; attrs...)
+Convenience constructor for `Element` nodes.
-#-----------------------------------------------------------------------------# interface fallbacks
-nodetype(o) = o.nodetype
-tag(o) = o.tag
-attributes(o) = o.attributes
-value(o) = o.value
-children(o::T) where {T} = isnothing(o.children) ? () : o.children
+ h("div", "hello"; class="main") # hello
+ h.div("hello"; class="main") # same thing
+"""
+function h(tag::Union{Symbol, AbstractString}, children...; attrs...)
+ t = String(tag)
+ a = Pair{String,String}[String(k) => String(v) for (k, v) in pairs(attrs)]
+ c = Node{String}[_to_node(x) for x in children]
+ Node{String}(Element, t, a, nothing, c)
+end
-depth(o) = missing
-parent(o) = missing
-next(o) = missing
-prev(o) = missing
+Base.getproperty(::typeof(h), tag::Symbol) = h(tag)
-is_simple(o) = nodetype(o) == Element && (isnothing(attributes(o)) || isempty(attributes(o))) &&
- length(children(o)) == 1 && nodetype(only(o)) in (Text, CData)
+function (o::Node)(args...; attrs...)
+ o.nodetype === Element || error("Only Element nodes are callable.")
+ old_children = something(o.children, ())
+ old_attrs = isnothing(o.attributes) ? () : (Symbol(k) => v for (k, v) in o.attributes)
+ h(o.tag, old_children..., args...; old_attrs..., attrs...)
+end
-simple_value(o) = is_simple(o) ? value(only(o)) : error("`XML.simple_value` is only defined for simple nodes.")
+#-----------------------------------------------------------------------------# DTD parsing
+struct ElementDecl
+ name::String
+ content::String # "EMPTY", "ANY", or content model like "(#PCDATA)" or "(a,b,c)*"
+end
-Base.@deprecate_binding simplevalue simple_value
+struct AttDecl
+ element::String
+ name::String
+ type::String # "CDATA", "ID", "(val1|val2)", "NOTATION (a|b)", etc.
+ default::String # "#REQUIRED", "#IMPLIED", "#FIXED \"val\"", or "\"val\""
+end
-#-----------------------------------------------------------------------------# nodes_equal
-function nodes_equal(a, b)
- out = XML.tag(a) == XML.tag(b)
- out &= XML.nodetype(a) == XML.nodetype(b)
- out &= XML.attributes(a) == XML.attributes(b)
- out &= XML.value(a) == XML.value(b)
- out &= length(XML.children(a)) == length(XML.children(b))
- out &= all(nodes_equal(ai, bi) for (ai,bi) in zip(XML.children(a), XML.children(b)))
- return out
+struct EntityDecl
+ name::String
+ value::Union{Nothing, String} # replacement text (internal entities)
+ external_id::Union{Nothing, String} # "SYSTEM \"uri\"" or "PUBLIC \"pubid\" \"uri\""
+ parameter::Bool
end
-Base.:(==)(a::AbstractXMLNode, b::AbstractXMLNode) = nodes_equal(a, b)
+struct NotationDecl
+ name::String
+ external_id::String
+end
-#-----------------------------------------------------------------------------# parse
-Base.parse(::Type{T}, str::AbstractString) where {T <: AbstractXMLNode} = parse(str, T)
+struct ParsedDTD
+ root::String
+ system_id::Union{Nothing, String}
+ public_id::Union{Nothing, String}
+ elements::Vector{ElementDecl}
+ attributes::Vector{AttDecl}
+ entities::Vector{EntityDecl}
+ notations::Vector{NotationDecl}
+end
-#-----------------------------------------------------------------------------# indexing
-Base.getindex(o::Union{Raw, AbstractXMLNode}) = o
-Base.getindex(o::Union{Raw, AbstractXMLNode}, i::Integer) = children(o)[i]
-Base.getindex(o::Union{Raw, AbstractXMLNode}, ::Colon) = children(o)
-Base.lastindex(o::Union{Raw, AbstractXMLNode}) = lastindex(children(o))
-
-Base.only(o::Union{Raw, AbstractXMLNode}) = only(children(o))
-
-Base.length(o::AbstractXMLNode) = length(children(o))
-
-#-----------------------------------------------------------------------------# printing
-function _show_node(io::IO, o)
- printstyled(io, typeof(o), ' '; color=:light_black)
- !ismissing(depth(o)) && printstyled(io, "(depth=", depth(o), ") ", color=:light_black)
- printstyled(io, nodetype(o), ; color=:light_green)
- if o.nodetype === Text
- printstyled(io, ' ', repr(value(o)))
- elseif o.nodetype === Element
- printstyled(io, " <", tag(o), color=:light_cyan)
- _print_attrs(io, o; color=:light_yellow)
- printstyled(io, '>', color=:light_cyan)
- _print_n_children(io, o)
- elseif o.nodetype === DTD
- printstyled(io, " ', color=:light_cyan)
- elseif o.nodetype === Declaration
- printstyled(io, " ", color=:light_cyan)
- elseif o.nodetype === ProcessingInstruction
- printstyled(io, " ", tag(o), color=:light_cyan)
- _print_attrs(io, o; color=:light_yellow)
- printstyled(io, "?>", color=:light_cyan)
- elseif o.nodetype === Comment
- printstyled(io, " ", color=:light_cyan)
- elseif o.nodetype === CData
- printstyled(io, " ", color=:light_cyan)
- elseif o.nodetype === Document
- _print_n_children(io, o)
- elseif o.nodetype === UNKNOWN
- printstyled(io, "Unknown", color=:light_cyan)
- _print_n_children(io, o)
- else
- error("Unreachable reached")
+# DTD parsing helpers
+@inline _dtd_is_name_char(c::Char) =
+ ('a' <= c <= 'z') || ('A' <= c <= 'Z') || ('0' <= c <= '9') ||
+ c == '_' || c == '-' || c == '.' || c == ':'
+
+function _dtd_skip_ws(s, pos)
+ while pos <= ncodeunits(s) && isspace(s[pos])
+ pos += 1
end
+ pos
end
-function _print_attrs(io::IO, o; color=:normal)
- attr = attributes(o)
- isnothing(attr) && return nothing
- for (k,v) in attr
- # printstyled(io, ' ', k, '=', '"', v, '"'; color)
- print(io, ' ', k, '=', '"', v, '"')
+function _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ start = pos
+ while pos <= ncodeunits(s) && _dtd_is_name_char(s[pos])
+ pos += 1
end
+ start == pos && error("Expected name at position $pos in DTD")
+ SubString(s, start, pos - 1), pos
end
-function _print_n_children(io::IO, o::Node)
- n = length(children(o))
- text = n == 0 ? "" : n == 1 ? " (1 child)" : " ($n children)"
- printstyled(io, text, color=:light_black)
-end
-_print_n_children(io::IO, o) = nothing
-
-#-----------------------------------------------------------------------------# write_xml
-write(x; kw...) = (io = IOBuffer(); write(io, x; kw...); String(take!(io)))
-
-write(filename::AbstractString, x; kw...) = open(io -> write(io, x; kw...), filename, "w")
-
-function write(io::IO, x, ctx::Vector{Bool}=[false]; indentsize::Int=2, depth::Int=1)
- indent = ' ' ^ indentsize
- nodetype = XML.nodetype(x)
- tag = XML.tag(x)
- value = XML.value(x)
- children = XML.children(x)
-
- padding = indent ^ max(0, depth - 1)
- !ctx[end] && print(io, padding)
-
- if nodetype === Text
- print(io, value)
-
- elseif nodetype === Element
- push!(ctx, ctx[end])
- update_ctx!(ctx, x)
- print(io, '<', tag)
- _print_attrs(io, x)
- print(io, isempty(children) ? '/' : "", '>')
- if !isempty(children)
- if length(children) == 1 && XML.nodetype(only(children)) === Text
- write(io, only(children), ctx; indentsize=0)
- print(io, "", tag, '>')
- else
- !ctx[end] && println(io)
- foreach(children) do child
- write(io, child, ctx; indentsize, depth=depth + 1)
- !ctx[end] && println(io)
- end
- print(io, !ctx[end] ? padding : "", "", tag, '>')
+
+function _dtd_read_quoted(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ q = s[pos]
+ (q == '"' || q == '\'') || error("Expected quoted string at position $pos in DTD")
+ pos += 1
+ start = pos
+ while pos <= ncodeunits(s) && s[pos] != q
+ pos += 1
+ end
+ val = SubString(s, start, pos - 1)
+ pos += 1
+ val, pos
+end
+
+function _dtd_read_parens(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ s[pos] == '(' || error("Expected '(' at position $pos in DTD")
+ depth = 1
+ start = pos
+ pos += 1
+ while pos <= ncodeunits(s) && depth > 0
+ c = s[pos]
+ if c == '('
+ depth += 1
+ elseif c == ')'
+ depth -= 1
+ elseif c == '"' || c == '\''
+ pos += 1
+ while pos <= ncodeunits(s) && s[pos] != c
+ pos += 1
end
end
- pop!(ctx)
+ pos += 1
+ end
+ SubString(s, start, pos - 1), pos
+end
- elseif nodetype === DTD
- print(io, "')
+function _dtd_skip_to_close(s, pos)
+ while pos <= ncodeunits(s) && s[pos] != '>'
+ c = s[pos]
+ if c == '"' || c == '\''
+ pos += 1
+ while pos <= ncodeunits(s) && s[pos] != c
+ pos += 1
+ end
+ end
+ pos += 1
+ end
+ pos <= ncodeunits(s) ? pos + 1 : pos
+end
- elseif nodetype === Declaration
- print(io, "")
+function _dtd_parse_element(s, pos)
+ name, pos = _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ if s[pos] == '('
+ content, pos = _dtd_read_parens(s, pos)
+ if pos <= ncodeunits(s) && s[pos] in ('*', '+', '?')
+ content = string(content, s[pos])
+ pos += 1
+ end
+ else
+ content, pos = _dtd_read_name(s, pos)
+ end
+ pos = _dtd_skip_to_close(s, pos)
+ ElementDecl(String(name), String(content)), pos
+end
- elseif nodetype === ProcessingInstruction
- print(io, "", tag)
- _print_attrs(io, x)
- print(io, "?>")
+function _dtd_parse_attlist(s, pos)
+ element, pos = _dtd_read_name(s, pos)
+ atts = AttDecl[]
+ while true
+ pos = _dtd_skip_ws(s, pos)
+ (pos > ncodeunits(s) || s[pos] == '>') && break
+
+ name, pos = _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+
+ # Attribute type
+ if s[pos] == '('
+ atype, pos = _dtd_read_parens(s, pos)
+ else
+ atype, pos = _dtd_read_name(s, pos)
+ if atype == "NOTATION"
+ pos = _dtd_skip_ws(s, pos)
+ parens, pos = _dtd_read_parens(s, pos)
+ atype = string("NOTATION ", parens)
+ end
+ end
+ pos = _dtd_skip_ws(s, pos)
- elseif nodetype === Comment
- print(io, "")
+ # Default declaration
+ if s[pos] == '#'
+ pos += 1
+ keyword, pos = _dtd_read_name(s, pos)
+ if keyword == "FIXED"
+ pos = _dtd_skip_ws(s, pos)
+ val, pos = _dtd_read_quoted(s, pos)
+ default = string("#FIXED \"", val, "\"")
+ else
+ default = string("#", keyword)
+ end
+ elseif s[pos] == '"' || s[pos] == '\''
+ val, pos = _dtd_read_quoted(s, pos)
+ default = string("\"", val, "\"")
+ else
+ error("Expected default declaration at position $pos in DTD")
+ end
+ push!(atts, AttDecl(String(element), String(name), String(atype), default))
+ end
+ pos <= ncodeunits(s) && s[pos] == '>' && (pos += 1)
+ atts, pos
+end
- elseif nodetype === CData
- print(io, "")
+function _dtd_parse_entity(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ parameter = false
+ if pos <= ncodeunits(s) && s[pos] == '%'
+ parameter = true
+ pos += 1
+ end
+ name, pos = _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
- elseif nodetype === Document
- foreach(children) do child
- write(io, child, ctx; indentsize)
- !ctx[end] && println(io)
+ value = nothing
+ external_id = nothing
+ if s[pos] == '"' || s[pos] == '\''
+ v, pos = _dtd_read_quoted(s, pos)
+ value = String(v)
+ else
+ keyword, pos = _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ if keyword == "SYSTEM"
+ uri, pos = _dtd_read_quoted(s, pos)
+ external_id = string("SYSTEM \"", uri, "\"")
+ elseif keyword == "PUBLIC"
+ pubid, pos = _dtd_read_quoted(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ uri, pos = _dtd_read_quoted(s, pos)
+ external_id = string("PUBLIC \"", pubid, "\" \"", uri, "\"")
+ else
+ error("Expected SYSTEM, PUBLIC, or quoted value in ENTITY declaration")
end
+ end
+ pos = _dtd_skip_to_close(s, pos)
+ EntityDecl(String(name), value, external_id, parameter), pos
+end
+function _dtd_parse_notation(s, pos)
+ name, pos = _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ keyword, pos = _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ if keyword == "SYSTEM"
+ uri, pos = _dtd_read_quoted(s, pos)
+ external_id = string("SYSTEM \"", uri, "\"")
+ elseif keyword == "PUBLIC"
+ pubid, pos = _dtd_read_quoted(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+ if pos <= ncodeunits(s) && (s[pos] == '"' || s[pos] == '\'')
+ uri, pos = _dtd_read_quoted(s, pos)
+ external_id = string("PUBLIC \"", pubid, "\" \"", uri, "\"")
+ else
+ external_id = string("PUBLIC \"", pubid, "\"")
+ end
else
- error("Unreachable case reached during XML.write")
+ error("Expected SYSTEM or PUBLIC in NOTATION declaration")
+ end
+ pos = _dtd_skip_to_close(s, pos)
+ NotationDecl(String(name), external_id), pos
+end
+
+"""
+ parse_dtd(value::AbstractString) -> ParsedDTD
+ parse_dtd(node::Node) -> ParsedDTD
+
+Parse a DTD value string (from a `DTD` node) into structured declarations.
+"""
+function parse_dtd(value::AbstractString)
+ s = String(value)
+ pos = 1
+
+ root, pos = _dtd_read_name(s, pos)
+ pos = _dtd_skip_ws(s, pos)
+
+ # External ID
+ system_id = nothing
+ public_id = nothing
+ if pos <= ncodeunits(s) && _dtd_is_name_char(s[pos])
+ keyword, kpos = _dtd_read_name(s, pos)
+ if keyword == "SYSTEM"
+ pos = kpos
+ uri, pos = _dtd_read_quoted(s, pos)
+ system_id = String(uri)
+ elseif keyword == "PUBLIC"
+ pos = kpos
+ pubid, pos = _dtd_read_quoted(s, pos)
+ public_id = String(pubid)
+ pos = _dtd_skip_ws(s, pos)
+ if pos <= ncodeunits(s) && (s[pos] == '"' || s[pos] == '\'')
+ uri, pos = _dtd_read_quoted(s, pos)
+ system_id = String(uri)
+ end
+ end
+ end
+
+ elements = ElementDecl[]
+ attributes = AttDecl[]
+ entities = EntityDecl[]
+ notations = NotationDecl[]
+
+ # Internal subset
+ pos = _dtd_skip_ws(s, pos)
+ if pos <= ncodeunits(s) && s[pos] == '['
+ pos += 1
+ while pos <= ncodeunits(s)
+ pos = _dtd_skip_ws(s, pos)
+ pos > ncodeunits(s) && break
+ s[pos] == ']' && break
+
+ rest = SubString(s, pos)
+ if startswith(rest, "", s, pos + 4)
+ isnothing(i) && error("Unterminated comment in DTD")
+ pos = last(i) + 1
+ elseif startswith(rest, "")
+ i = findnext("?>", s, pos + 2)
+ isnothing(i) && error("Unterminated PI in DTD")
+ pos = last(i) + 1
+ elseif startswith(rest, " string(v) for (k, v) in pairs(attributes)),
+# isnothing(value) ? nothing : string(value),
+# isnothing(children) ? nothing :
+# children isa Node ? [children] :
+# children isa Vector{Node} ? children :
+# children isa Vector ? map(Node, children) :
+# children isa Tuple ? map(Node, collect(children)) :
+# [Node(children)]
+# )
+# end
+# end
+
+# function Node(o::Node, x...; kw...)
+# attrs = !isnothing(kw) ?
+# merge(
+# OrderedDict(string(k) => string(v) for (k, v) in pairs(kw)),
+# isnothing(o.attributes) ? OrderedDict{String,String}() : o.attributes
+# ) :
+# o.attributes
+# children = isempty(x) ? o.children : vcat(isnothing(o.children) ? [] : o.children, collect(x))
+# Node(o.nodetype, o.tag, attrs, o.value, children)
+# end
+
+# function Node(node::LazyNode)
+# nodetype = node.nodetype
+# tag = node.tag
+# attributes = node.attributes
+# value = node.value
+# c = XML.children(node)
+# Node(nodetype, tag, attributes, value, isempty(c) ? nothing : map(Node, c))
+# end
+
+# Node(data::Raw) = Node(LazyNode(data))
+
+# # Anything that's not Vector{UInt8} or a (Lazy)Node is converted to a Text Node
+# Node(x) = Node(Text, nothing, nothing, string(x), nothing)
+
+# h(tag::Union{Symbol, String}, children...; kw...) = Node(Element, tag, kw, nothing, children)
+# Base.getproperty(::typeof(h), tag::Symbol) = h(tag)
+# (o::Node)(children...; kw...) = Node(o, Node.(children)...; kw...)
+
+# # NOT in-place for Text Nodes
+# function escape!(o::Node, warn::Bool=true)
+# if o.nodetype == Text
+# warn && @warn "escape!() called on a Text Node creates a new node."
+# return Text(escape(o.value))
+# end
+# isnothing(o.children) && return o
+# map!(x -> escape!(x, false), o.children, o.children)
+# o
+# end
+# function unescape!(o::Node, warn::Bool=true)
+# if o.nodetype == Text
+# warn && @warn "unescape!() called on a Text Node creates a new node."
+# return Text(unescape(o.value))
+# end
+# isnothing(o.children) && return o
+# map!(x -> unescape!(x, false), o.children, o.children)
+# o
+# end
+
+
+# Base.read(filename::AbstractString, ::Type{Node}) = Node(read(filename, Raw))
+# Base.read(io::IO, ::Type{Node}) = Node(read(io, Raw))
+# Base.parse(x::AbstractString, ::Type{Node}) = Node(parse(x, Raw))
+
+# Base.setindex!(o::Node, val, i::Integer) = o.children[i] = Node(val)
+# Base.push!(a::Node, b::Node) = push!(a.children, b)
+# Base.pushfirst!(a::Node, b::Node) = pushfirst!(a.children, b)
+
+# Base.setindex!(o::Node, val, key::AbstractString) = (o.attributes[key] = string(val))
+# Base.getindex(o::Node, val::AbstractString) = o.attributes[val]
+# Base.haskey(o::Node, key::AbstractString) = isnothing(o.attributes) ? false : haskey(o.attributes, key)
+# Base.keys(o::Node) = isnothing(o.attributes) ? () : keys(o.attributes)
+
+# Base.show(io::IO, o::Node) = _show_node(io, o)
+
+# #-----------------------------------------------------------------------------# Node Constructors
+# function (T::NodeType)(args...; attr...)
+# if T === Document
+# !isempty(attr) && error("Document nodes do not have attributes.")
+# Node(T, nothing, nothing, nothing, args)
+# elseif T === DTD
+# !isempty(attr) && error("DTD nodes only accept a value.")
+# length(args) > 1 && error("DTD nodes only accept a value.")
+# Node(T, nothing, nothing, only(args))
+# elseif T === Declaration
+# !isempty(args) && error("Declaration nodes only accept attributes")
+# Node(T, nothing, attr)
+# elseif T === ProcessingInstruction
+# length(args) == 1 || error("ProcessingInstruction nodes require a tag and attributes.")
+# Node(T, only(args), attr)
+# elseif T === Comment
+# !isempty(attr) && error("Comment nodes do not have attributes.")
+# length(args) > 1 && error("Comment nodes only accept a single input.")
+# Node(T, nothing, nothing, only(args))
+# elseif T === CData
+# !isempty(attr) && error("CData nodes do not have attributes.")
+# length(args) > 1 && error("CData nodes only accept a single input.")
+# Node(T, nothing, nothing, only(args))
+# elseif T === Text
+# !isempty(attr) && error("Text nodes do not have attributes.")
+# length(args) > 1 && error("Text nodes only accept a single input.")
+# Node(T, nothing, nothing, only(args))
+# elseif T === Element
+# tag = first(args)
+# Node(T, tag, attr, nothing, args[2:end])
+# else
+# error("Unreachable reached while trying to create a Node via (::NodeType)(args...; kw...).")
+# end
+# end
+
+# #-----------------------------------------------------------------------------# !!! common !!!
+# # Everything below here is common to all data structures
+
+
+# #-----------------------------------------------------------------------------# interface fallbacks
+# nodetype(o) = o.nodetype
+# tag(o) = o.tag
+# attributes(o) = o.attributes
+# value(o) = o.value
+# children(o::T) where {T} = isnothing(o.children) ? () : o.children
+
+# depth(o) = missing
+# parent(o) = missing
+# next(o) = missing
+# prev(o) = missing
+
+# is_simple(o) = nodetype(o) == Element && (isnothing(attributes(o)) || isempty(attributes(o))) &&
+# length(children(o)) == 1 && nodetype(only(o)) in (Text, CData)
+
+# simple_value(o) = is_simple(o) ? value(only(o)) : error("`XML.simple_value` is only defined for simple nodes.")
+
+# Base.@deprecate_binding simplevalue simple_value
+
+# #-----------------------------------------------------------------------------# nodes_equal
+# function nodes_equal(a, b)
+# out = XML.tag(a) == XML.tag(b)
+# out &= XML.nodetype(a) == XML.nodetype(b)
+# out &= XML.attributes(a) == XML.attributes(b)
+# out &= XML.value(a) == XML.value(b)
+# out &= length(XML.children(a)) == length(XML.children(b))
+# out &= all(nodes_equal(ai, bi) for (ai,bi) in zip(XML.children(a), XML.children(b)))
+# return out
+# end
+
+# Base.:(==)(a::AbstractXMLNode, b::AbstractXMLNode) = nodes_equal(a, b)
+
+# #-----------------------------------------------------------------------------# parse
+# Base.parse(::Type{T}, str::AbstractString) where {T <: AbstractXMLNode} = parse(str, T)
+
+# #-----------------------------------------------------------------------------# indexing
+# Base.getindex(o::Union{Raw, AbstractXMLNode}) = o
+# Base.getindex(o::Union{Raw, AbstractXMLNode}, i::Integer) = children(o)[i]
+# Base.getindex(o::Union{Raw, AbstractXMLNode}, ::Colon) = children(o)
+# Base.lastindex(o::Union{Raw, AbstractXMLNode}) = lastindex(children(o))
+
+# Base.only(o::Union{Raw, AbstractXMLNode}) = only(children(o))
+
+# Base.length(o::AbstractXMLNode) = length(children(o))
+
+# #-----------------------------------------------------------------------------# printing
+# function _show_node(io::IO, o)
+# printstyled(io, typeof(o), ' '; color=:light_black)
+# !ismissing(depth(o)) && printstyled(io, "(depth=", depth(o), ") ", color=:light_black)
+# printstyled(io, nodetype(o), ; color=:light_green)
+# if o.nodetype === Text
+# printstyled(io, ' ', repr(value(o)))
+# elseif o.nodetype === Element
+# printstyled(io, " <", tag(o), color=:light_cyan)
+# _print_attrs(io, o; color=:light_yellow)
+# printstyled(io, '>', color=:light_cyan)
+# _print_n_children(io, o)
+# elseif o.nodetype === DTD
+# printstyled(io, " ', color=:light_cyan)
+# elseif o.nodetype === Declaration
+# printstyled(io, " ", color=:light_cyan)
+# elseif o.nodetype === ProcessingInstruction
+# printstyled(io, " ", tag(o), color=:light_cyan)
+# _print_attrs(io, o; color=:light_yellow)
+# printstyled(io, "?>", color=:light_cyan)
+# elseif o.nodetype === Comment
+# printstyled(io, " ", color=:light_cyan)
+# elseif o.nodetype === CData
+# printstyled(io, " ", color=:light_cyan)
+# elseif o.nodetype === Document
+# _print_n_children(io, o)
+# elseif o.nodetype === UNKNOWN
+# printstyled(io, "Unknown", color=:light_cyan)
+# _print_n_children(io, o)
+# else
+# error("Unreachable reached")
+# end
+# end
+
+# function _print_attrs(io::IO, o; color=:normal)
+# attr = attributes(o)
+# isnothing(attr) && return nothing
+# for (k,v) in attr
+# # printstyled(io, ' ', k, '=', '"', v, '"'; color)
+# print(io, ' ', k, '=', '"', v, '"')
+# end
+# end
+# function _print_n_children(io::IO, o::Node)
+# n = length(children(o))
+# text = n == 0 ? "" : n == 1 ? " (1 child)" : " ($n children)"
+# printstyled(io, text, color=:light_black)
+# end
+# _print_n_children(io::IO, o) = nothing
+
+# #-----------------------------------------------------------------------------# write_xml
+# write(x; kw...) = (io = IOBuffer(); write(io, x; kw...); String(take!(io)))
+
+# write(filename::AbstractString, x; kw...) = open(io -> write(io, x; kw...), filename, "w")
+
+# function write(io::IO, x, ctx::Vector{Bool}=[false]; indentsize::Int=2, depth::Int=1)
+# indent = ' ' ^ indentsize
+# nodetype = XML.nodetype(x)
+# tag = XML.tag(x)
+# value = XML.value(x)
+# children = XML.children(x)
+
+# padding = indent ^ max(0, depth - 1)
+# !ctx[end] && print(io, padding)
+
+# if nodetype === Text
+# print(io, value)
+
+# elseif nodetype === Element
+# push!(ctx, ctx[end])
+# update_ctx!(ctx, x)
+# print(io, '<', tag)
+# _print_attrs(io, x)
+# print(io, isempty(children) ? '/' : "", '>')
+# if !isempty(children)
+# if length(children) == 1 && XML.nodetype(only(children)) === Text
+# write(io, only(children), ctx; indentsize=0)
+# print(io, "", tag, '>')
+# else
+# !ctx[end] && println(io)
+# foreach(children) do child
+# write(io, child, ctx; indentsize, depth=depth + 1)
+# !ctx[end] && println(io)
+# end
+# print(io, !ctx[end] ? padding : "", "", tag, '>')
+# end
+# end
+# pop!(ctx)
+
+# elseif nodetype === DTD
+# print(io, "')
+
+# elseif nodetype === Declaration
+# print(io, "")
+
+# elseif nodetype === ProcessingInstruction
+# print(io, "", tag)
+# _print_attrs(io, x)
+# print(io, "?>")
+
+# elseif nodetype === Comment
+# print(io, "")
+
+# elseif nodetype === CData
+# print(io, "")
+
+# elseif nodetype === Document
+# foreach(children) do child
+# write(io, child, ctx; indentsize)
+# !ctx[end] && println(io)
+# end
+
+# else
+# error("Unreachable case reached during XML.write")
+# end
+
+# end
+
+#-----------------------------------------------------------------------------# deprecations
+Base.@deprecate_binding simplevalue simple_value false
+Base.@deprecate_binding LazyNode Node false
+
+# Removed types — informative errors
+struct Raw
+ Raw(args...; kw...) = error("""
+ `XML.Raw` has been removed in XML.jl v0.4.
+ Use `parse(str, Node)` or `read(filename, Node)` instead.
+ The streaming Raw/LazyNode API has been replaced by a token-based parser.
+ See `?XML.Node` for the new API.""")
+end
+
+struct AbstractXMLNode
+ AbstractXMLNode(args...; kw...) = error("""
+ `XML.AbstractXMLNode` has been removed in XML.jl v0.4.
+ `Node` is no longer a subtype of an abstract type.
+ Dispatch on `Node` directly instead.""")
+end
+
+# Removed functions — informative errors
+const _REMOVED_LAZYNODE_MSG = """
+ This function was part of the LazyNode API, which has been removed in XML.jl v0.4.
+ Use `parse(str, Node)` to get a full DOM tree and navigate with `children`, `tag`,
+ `attributes`, `value`, and integer indexing (e.g. `node[1]`)."""
+
+for f in (:next, :prev)
+ msg = "`XML.$f` has been removed. $_REMOVED_LAZYNODE_MSG"
+ @eval function $f(o::Node)
+ Base.depwarn($msg, $(QuoteNode(f)))
+ error($msg)
+ end
+end
+
+# 1-arg parent/depth were part of LazyNode API; 2-arg versions are defined above
+const _PARENT_1ARG_MSG = "`XML.parent(node)` (single-argument) has been removed. $_REMOVED_LAZYNODE_MSG\n Use `parent(child, root)` instead to search from a known root node."
+function Base.parent(o::Node)
+ Base.depwarn(_PARENT_1ARG_MSG, :parent)
+ error(_PARENT_1ARG_MSG)
+end
+
+const _DEPTH_1ARG_MSG = "`XML.depth(node)` (single-argument) has been removed. $_REMOVED_LAZYNODE_MSG\n Use `depth(child, root)` instead to search from a known root node."
+function depth(o::Node)
+ Base.depwarn(_DEPTH_1ARG_MSG, :depth)
+ error(_DEPTH_1ARG_MSG)
+end
+
+function nodes_equal(a, b)
+ msg = """`XML.nodes_equal` has been removed in XML.jl v0.4. Use `==` instead:
+ a == b"""
+ Base.depwarn(msg, :nodes_equal)
+ error(msg)
+end
+
+function escape!(o::Node, warn::Bool=true)
+ msg = """`XML.escape!` has been removed in XML.jl v0.4.
+ Text is now escaped automatically during `XML.write`."""
+ Base.depwarn(msg, :escape!)
+ error(msg)
+end
+
+function unescape!(o::Node, warn::Bool=true)
+ msg = """`XML.unescape!` has been removed in XML.jl v0.4.
+ Text is now unescaped automatically during `parse`."""
+ Base.depwarn(msg, :unescape!)
+ error(msg)
end
end # module XML
diff --git a/src/tokenizer.jl b/src/tokenizer.jl
new file mode 100644
index 0000000..355036d
--- /dev/null
+++ b/src/tokenizer.jl
@@ -0,0 +1,480 @@
+"""
+ XMLTokenizer
+
+A self-contained module for tokenizing XML documents into a fine-grained stream of tokens.
+
+# Usage
+
+```julia
+using .XMLTokenizer: tokenize, tag_name, attr_value, pi_target
+
+for token in tokenize(\"\"\"text\"\"\")
+ println(token)
+end
+```
+"""
+module XMLTokenizer
+
+export tokenize, tag_name, attr_value, pi_target, TokenKind, Token,
+ TOKEN_TEXT,
+ TOKEN_OPEN_TAG, TOKEN_CLOSE_TAG, TOKEN_TAG_CLOSE, TOKEN_SELF_CLOSE,
+ TOKEN_ATTR_NAME, TOKEN_ATTR_VALUE,
+ TOKEN_CDATA_OPEN, TOKEN_CDATA_CONTENT, TOKEN_CDATA_CLOSE,
+ TOKEN_COMMENT_OPEN, TOKEN_COMMENT_CONTENT, TOKEN_COMMENT_CLOSE,
+ TOKEN_PI_OPEN, TOKEN_PI_CONTENT, TOKEN_PI_CLOSE,
+ TOKEN_XML_DECL_OPEN, TOKEN_XML_DECL_CLOSE,
+ TOKEN_DOCTYPE_OPEN, TOKEN_DOCTYPE_CONTENT, TOKEN_DOCTYPE_CLOSE
+
+#-----------------------------------------------------------------------# TokenKind
+@enum TokenKind::UInt8 begin
+ # Character data
+ TOKEN_TEXT # text content between markup
+
+ # Element tags
+ TOKEN_OPEN_TAG #
+ TOKEN_SELF_CLOSE # />
+ TOKEN_ATTR_NAME # attribute name
+ TOKEN_ATTR_VALUE # "value" or 'value' (with quotes in raw)
+
+ # CDATA sections
+ TOKEN_CDATA_OPEN #
+
+ # Comments
+ TOKEN_COMMENT_OPEN #
+
+ # Processing instructions
+ TOKEN_PI_OPEN #
+
+ # XML declaration ()
+ TOKEN_XML_DECL_OPEN #
+ # (reuses TOKEN_ATTR_NAME / TOKEN_ATTR_VALUE for pseudo-attributes)
+
+ # DOCTYPE
+ TOKEN_DOCTYPE_OPEN #
+end
+
+#-----------------------------------------------------------------------# Token
+struct Token
+ kind::TokenKind
+ raw::SubString{String}
+end
+
+function Base.show(io::IO, t::Token)
+ print(io, t.kind, ": ", repr(String(t.raw)))
+end
+
+#-----------------------------------------------------------------------# Tokenizer state
+@enum _State::UInt8 begin
+ _S_DEFAULT # normal content mode
+ _S_TAG # inside open tag, reading attributes
+ _S_TAG_VALUE # expecting quoted attribute value
+ _S_CLOSE_TAG # inside close tag, expecting >
+ _S_XML_DECL # inside Tokenizer
+
+Return a lazy iterator of `Token`s over the XML string `xml`.
+"""
+tokenize(xml::String) = Tokenizer(xml, 1, _S_DEFAULT, nothing)
+
+Base.IteratorSize(::Type{Tokenizer}) = Base.SizeUnknown()
+Base.eltype(::Type{Tokenizer}) = Token
+
+function Base.iterate(t::Tokenizer, _=nothing)
+ tok = _next_token!(t)
+ tok === nothing ? nothing : (tok, nothing)
+end
+
+#-----------------------------------------------------------------------# Internal helpers
+@inline _iseof(t::Tokenizer) = t.pos > ncodeunits(t.data)
+@inline _peek(t::Tokenizer) = @inbounds codeunit(t.data, t.pos)
+@inline _peek(t::Tokenizer, offset::Int) = @inbounds codeunit(t.data, t.pos + offset)
+@inline _canpeek(t::Tokenizer, offset::Int) = t.pos + offset <= ncodeunits(t.data)
+
+@inline function _is_name_byte(b::UInt8)
+ (UInt8('a') <= b <= UInt8('z')) || (UInt8('A') <= b <= UInt8('Z')) ||
+ (UInt8('0') <= b <= UInt8('9')) || b == UInt8('_') || b == UInt8('-') ||
+ b == UInt8('.') || b == UInt8(':')
+end
+
+@inline function _is_whitespace(b::UInt8)
+ b == UInt8(' ') || b == UInt8('\t') || b == UInt8('\n') || b == UInt8('\r')
+end
+
+function _skip_whitespace!(t::Tokenizer)
+ while !_iseof(t) && _is_whitespace(_peek(t))
+ t.pos += 1
+ end
+end
+
+function _skip_quoted!(t::Tokenizer)
+ q = _peek(t)
+ t.pos += 1
+ while !_iseof(t)
+ _peek(t) == q && (t.pos += 1; return)
+ t.pos += 1
+ end
+ error("Unterminated quoted string")
+end
+
+@noinline _err(msg, pos) = throw(ArgumentError("XML tokenizer error at position $pos: $msg"))
+
+#-----------------------------------------------------------------------# Main dispatch
+function _next_token!(t::Tokenizer)
+ if t.pending !== nothing
+ tok = t.pending::Token
+ t.pending = nothing
+ return tok
+ end
+ _iseof(t) && return nothing
+
+ s = t.state
+ if s == _S_DEFAULT
+ _peek(t) == UInt8('<') ? _read_markup!(t) : _read_text!(t)
+ elseif s == _S_TAG || s == _S_XML_DECL
+ _read_in_tag!(t)
+ elseif s == _S_TAG_VALUE || s == _S_XML_DECL_VALUE
+ _read_attr_value!(t)
+ elseif s == _S_CLOSE_TAG
+ _read_close_tag_end!(t)
+ elseif s == _S_COMMENT
+ _read_comment_body!(t)
+ elseif s == _S_CDATA
+ _read_cdata_body!(t)
+ elseif s == _S_PI
+ _read_pi_body!(t)
+ else # _S_DOCTYPE
+ _read_doctype_body!(t)
+ end
+end
+
+#-----------------------------------------------------------------------# S_DEFAULT tokens
+function _read_text!(t::Tokenizer)
+ start = t.pos
+ while !_iseof(t) && _peek(t) != UInt8('<')
+ t.pos += 1
+ end
+ Token(TOKEN_TEXT, SubString(t.data, start, prevind(t.data, t.pos)))
+end
+
+function _read_markup!(t::Tokenizer)
+ start = t.pos
+ t.pos += 1 # skip '<'
+ _iseof(t) && _err("unexpected end of input after '<'", start)
+
+ b = _peek(t)
+ if b == UInt8('!')
+ _read_bang!(t, start)
+ elseif b == UInt8('?')
+ _read_pi_start!(t, start)
+ elseif b == UInt8('/')
+ _read_close_tag_start!(t, start)
+ else
+ _read_open_tag_start!(t, start)
+ end
+end
+
+#-----------------------------------------------------------------------# or other ') || _err("expected '>'", t.pos)
+ start = t.pos
+ t.pos += 1
+ t.state = _S_DEFAULT
+ Token(TOKEN_TAG_CLOSE, SubString(t.data, start, start))
+end
+
+#-----------------------------------------------------------------------# Attributes (shared by S_TAG and S_XML_DECL)
+function _read_in_tag!(t::Tokenizer)
+ _skip_whitespace!(t)
+ _iseof(t) && _err("unterminated tag", t.pos)
+
+ b = _peek(t)
+ is_decl = (t.state == _S_XML_DECL)
+
+ # Check for end delimiters
+ if is_decl
+ if b == UInt8('?') && _canpeek(t, 1) && _peek(t, 1) == UInt8('>')
+ start = t.pos; t.pos += 2; t.state = _S_DEFAULT
+ return Token(TOKEN_XML_DECL_CLOSE, SubString(t.data, start, t.pos - 1))
+ end
+ else
+ if b == UInt8('>')
+ start = t.pos; t.pos += 1; t.state = _S_DEFAULT
+ return Token(TOKEN_TAG_CLOSE, SubString(t.data, start, start))
+ end
+ if b == UInt8('/') && _canpeek(t, 1) && _peek(t, 1) == UInt8('>')
+ start = t.pos; t.pos += 2; t.state = _S_DEFAULT
+ return Token(TOKEN_SELF_CLOSE, SubString(t.data, start, t.pos - 1))
+ end
+ end
+
+ # Attribute name
+ name_start = t.pos
+ while !_iseof(t) && _is_name_byte(_peek(t))
+ t.pos += 1
+ end
+ name_end = t.pos - 1
+ name_start > name_end && _err("expected attribute name or tag close", t.pos)
+
+ # Consume '=' and surrounding whitespace (not part of any token)
+ _skip_whitespace!(t)
+ (!_iseof(t) && _peek(t) == UInt8('=')) || _err("expected '=' after attribute name", t.pos)
+ t.pos += 1
+ _skip_whitespace!(t)
+
+ t.state = is_decl ? _S_XML_DECL_VALUE : _S_TAG_VALUE
+ Token(TOKEN_ATTR_NAME, SubString(t.data, name_start, name_end))
+end
+
+function _read_attr_value!(t::Tokenizer)
+ _iseof(t) && _err("expected attribute value", t.pos)
+
+ q = _peek(t)
+ (q == UInt8('"') || q == UInt8('\'')) || _err("expected quoted attribute value", t.pos)
+
+ start = t.pos
+ t.pos += 1 # skip opening quote
+ while !_iseof(t) && _peek(t) != q
+ t.pos += 1
+ end
+ _iseof(t) && _err("unterminated attribute value", start)
+ t.pos += 1 # skip closing quote
+
+ t.state = (t.state == _S_XML_DECL_VALUE) ? _S_XML_DECL : _S_TAG
+ Token(TOKEN_ATTR_VALUE, SubString(t.data, start, prevind(t.data, t.pos)))
+end
+
+#-----------------------------------------------------------------------# Content bodies (comment, CDATA, PI, DOCTYPE)
+function _read_comment_body!(t::Tokenizer)
+ start = t.pos
+ while !_iseof(t)
+ if _peek(t) == UInt8('-') &&
+ _canpeek(t, 1) && _peek(t, 1) == UInt8('-') &&
+ _canpeek(t, 2) && _peek(t, 2) == UInt8('>')
+ content_end = prevind(t.data, t.pos)
+ close_start = t.pos
+ t.pos += 3
+ t.state = _S_DEFAULT
+ t.pending = Token(TOKEN_COMMENT_CLOSE, SubString(t.data, close_start, t.pos - 1))
+ return Token(TOKEN_COMMENT_CONTENT, SubString(t.data, start, content_end))
+ end
+ t.pos += 1
+ end
+ _err("unterminated comment", start)
+end
+
+function _read_cdata_body!(t::Tokenizer)
+ start = t.pos
+ while !_iseof(t)
+ if _peek(t) == UInt8(']') &&
+ _canpeek(t, 1) && _peek(t, 1) == UInt8(']') &&
+ _canpeek(t, 2) && _peek(t, 2) == UInt8('>')
+ content_end = prevind(t.data, t.pos)
+ close_start = t.pos
+ t.pos += 3
+ t.state = _S_DEFAULT
+ t.pending = Token(TOKEN_CDATA_CLOSE, SubString(t.data, close_start, t.pos - 1))
+ return Token(TOKEN_CDATA_CONTENT, SubString(t.data, start, content_end))
+ end
+ t.pos += 1
+ end
+ _err("unterminated CDATA section", start)
+end
+
+function _read_pi_body!(t::Tokenizer)
+ start = t.pos
+ while !_iseof(t)
+ if _peek(t) == UInt8('?') && _canpeek(t, 1) && _peek(t, 1) == UInt8('>')
+ content_end = prevind(t.data, t.pos)
+ close_start = t.pos
+ t.pos += 2
+ t.state = _S_DEFAULT
+ t.pending = Token(TOKEN_PI_CLOSE, SubString(t.data, close_start, t.pos - 1))
+ return Token(TOKEN_PI_CONTENT, SubString(t.data, start, content_end))
+ end
+ t.pos += 1
+ end
+ _err("unterminated processing instruction", start)
+end
+
+function _read_doctype_body!(t::Tokenizer)
+ start = t.pos
+ depth = 0
+ while !_iseof(t)
+ b = _peek(t)
+ if b == UInt8('-') && _canpeek(t, 1) && _peek(t, 1) == UInt8('-') &&
+ t.pos >= 2 &&
+ codeunit(t.data, t.pos - 1) == UInt8('!') &&
+ codeunit(t.data, t.pos - 2) == UInt8('<')
+ # Inside a
+ t.pos += 2 # skip "--"
+ while !_iseof(t)
+ if _peek(t) == UInt8('-') && _canpeek(t, 1) && _peek(t, 1) == UInt8('-') &&
+ _canpeek(t, 2) && _peek(t, 2) == UInt8('>')
+ t.pos += 3 # skip "-->"
+ break
+ end
+ t.pos += 1
+ end
+ elseif b == UInt8('"') || b == UInt8('\'')
+ _skip_quoted!(t)
+ elseif b == UInt8('[')
+ depth += 1
+ t.pos += 1
+ elseif b == UInt8(']')
+ depth -= 1
+ t.pos += 1
+ elseif b == UInt8('>') && depth == 0
+ content_end = prevind(t.data, t.pos)
+ close_start = t.pos
+ t.pos += 1
+ t.state = _S_DEFAULT
+ t.pending = Token(TOKEN_DOCTYPE_CLOSE, SubString(t.data, close_start, t.pos - 1))
+ return Token(TOKEN_DOCTYPE_CONTENT, SubString(t.data, start, content_end))
+ else
+ t.pos += 1
+ end
+ end
+ _err("unterminated DOCTYPE", start)
+end
+
+#-----------------------------------------------------------------------# Utility functions
+
+"""
+ tag_name(token::Token) -> SubString{String}
+
+Extract the element name from an `OPEN_TAG` or `CLOSE_TAG` token.
+"""
+function tag_name(token::Token)
+ if token.kind == TOKEN_OPEN_TAG
+ SubString(token.raw, 2, ncodeunits(token.raw)) # skip '<'
+ elseif token.kind == TOKEN_CLOSE_TAG
+ SubString(token.raw, 3, ncodeunits(token.raw)) # skip ''
+ else
+ throw(ArgumentError("tag_name requires OPEN_TAG or CLOSE_TAG, got $(token.kind)"))
+ end
+end
+
+"""
+ attr_value(token::Token) -> SubString{String}
+
+Strip the surrounding quotes from an `ATTR_VALUE` token.
+"""
+function attr_value(token::Token)
+ token.kind == TOKEN_ATTR_VALUE ||
+ throw(ArgumentError("attr_value requires ATTR_VALUE, got $(token.kind)"))
+ SubString(token.raw, 2, prevind(token.raw, lastindex(token.raw)))
+end
+
+"""
+ pi_target(token::Token) -> SubString{String}
+
+Extract the target name from a `PI_OPEN` or `XML_DECL_OPEN` token.
+"""
+function pi_target(token::Token)
+ (token.kind == TOKEN_PI_OPEN || token.kind == TOKEN_XML_DECL_OPEN) ||
+ throw(ArgumentError("pi_target requires PI_OPEN or XML_DECL_OPEN, got $(token.kind)"))
+ SubString(token.raw, 3, ncodeunits(token.raw)) # skip ''
+end
+
+end # module XMLTokenizer
diff --git a/src/xpath.jl b/src/xpath.jl
new file mode 100644
index 0000000..b0a9725
--- /dev/null
+++ b/src/xpath.jl
@@ -0,0 +1,300 @@
+#-----------------------------------------------------------------------------# XPath
+# A subset of XPath 1.0 for querying XML.Node trees.
+#
+# Supported syntax:
+# / root (absolute path)
+# tag child element by name
+# * any child element
+# // descendant-or-self (recursive)
+# . current node
+# .. parent node
+# [n] positional predicate (1-based)
+# [@attr] has-attribute predicate
+# [@attr='v'] attribute-value predicate
+# text() text node children
+# node() all node children
+# @attr attribute value (returns strings)
+
+#-----------------------------------------------------------------------------# Token types
+
+@enum XPathTokenKind::UInt8 begin
+ XPATH_ROOT # /
+ XPATH_DESCENDANT # //
+ XPATH_NAME # tag name
+ XPATH_WILDCARD # *
+ XPATH_DOT # .
+ XPATH_DOTDOT # ..
+ XPATH_TEXT_FN # text()
+ XPATH_NODE_FN # node()
+ XPATH_PREDICATE # [...]
+ XPATH_ATTRIBUTE # @attr (in result position)
+end
+
+struct XPathToken
+ kind::XPathTokenKind
+ value::String
+end
+
+#-----------------------------------------------------------------------------# Tokenizer
+
+function _xpath_tokenize(expr::AbstractString)
+ tokens = XPathToken[]
+ s = String(expr)
+ i = 1
+ n = ncodeunits(s)
+
+ while i <= n
+ c = s[i]
+
+ if c == '/'
+ if i < n && s[i+1] == '/'
+ push!(tokens, XPathToken(XPATH_DESCENDANT, "//"))
+ i += 2
+ else
+ push!(tokens, XPathToken(XPATH_ROOT, "/"))
+ i += 1
+ end
+
+ elseif c == '.'
+ if i < n && s[i+1] == '.'
+ push!(tokens, XPathToken(XPATH_DOTDOT, ".."))
+ i += 2
+ else
+ push!(tokens, XPathToken(XPATH_DOT, "."))
+ i += 1
+ end
+
+ elseif c == '*'
+ push!(tokens, XPathToken(XPATH_WILDCARD, "*"))
+ i += 1
+
+ elseif c == '['
+ j = findnext(']', s, i + 1)
+ isnothing(j) && error("Unterminated predicate in XPath: $(repr(s))")
+ push!(tokens, XPathToken(XPATH_PREDICATE, SubString(s, i + 1, j - 1)))
+ i = j + 1
+
+ elseif c == '@'
+ j = i + 1
+ while j <= n && (isletter(s[j]) || s[j] == '-' || s[j] == '_' || s[j] == ':' || isdigit(s[j]))
+ j += 1
+ end
+ j == i + 1 && error("Empty attribute name after @ in XPath: $(repr(s))")
+ push!(tokens, XPathToken(XPATH_ATTRIBUTE, SubString(s, i + 1, j - 1)))
+ i = j
+
+ elseif isletter(c) || c == '_'
+ j = i + 1
+ while j <= n && (isletter(s[j]) || s[j] == '-' || s[j] == '_' || s[j] == ':' || isdigit(s[j]) || s[j] == '.')
+ j += 1
+ end
+ name = SubString(s, i, j - 1)
+ # Check for function calls: text(), node()
+ if j <= n && s[j] == '('
+ j2 = findnext(')', s, j + 1)
+ isnothing(j2) && error("Unterminated function call in XPath: $(repr(s))")
+ if name == "text"
+ push!(tokens, XPathToken(XPATH_TEXT_FN, "text()"))
+ elseif name == "node"
+ push!(tokens, XPathToken(XPATH_NODE_FN, "node()"))
+ else
+ error("Unknown XPath function: $name()")
+ end
+ i = j2 + 1
+ else
+ push!(tokens, XPathToken(XPATH_NAME, String(name)))
+ i = j
+ end
+
+ elseif isspace(c)
+ i += 1
+
+ else
+ error("Unexpected character '$(c)' in XPath: $(repr(s))")
+ end
+ end
+ tokens
+end
+
+#-----------------------------------------------------------------------------# Predicate evaluation
+
+function _eval_predicate(predicate::AbstractString, nodes::Vector{Node{S}}, root::Node{S}) where S
+ s = strip(predicate)
+
+ # Positional: [n]
+ pos = tryparse(Int, s)
+ if !isnothing(pos)
+ 1 <= pos <= length(nodes) || return Node{S}[]
+ return [nodes[pos]]
+ end
+
+ # last()
+ if s == "last()"
+ isempty(nodes) && return Node{S}[]
+ return [nodes[end]]
+ end
+
+ # [@attr] — has attribute
+ m = match(r"^@([A-Za-z_:][\w.\-:]*)$", s)
+ if !isnothing(m)
+ attr_name = m.captures[1]
+ return filter(n -> n.nodetype === Element && haskey(n, attr_name), nodes)
+ end
+
+ # [@attr='value'] or [@attr="value"]
+ m = match(r"^@([A-Za-z_:][\w.\-:]*)\s*=\s*['\"]([^'\"]*)['\"]$", s)
+ if !isnothing(m)
+ attr_name = m.captures[1]
+ attr_val = m.captures[2]
+ return filter(n -> n.nodetype === Element && get(n, attr_name, nothing) == attr_val, nodes)
+ end
+
+ error("Unsupported XPath predicate: [$predicate]")
+end
+
+#-----------------------------------------------------------------------------# Step evaluation
+
+function _xpath_step(nodes::Vector{Node{S}}, token::XPathToken, root::Node{S}) where S
+ result = Node{S}[]
+ k = token.kind
+
+ if k === XPATH_NAME
+ for n in nodes
+ for c in children(n)
+ c.nodetype === Element && c.tag == token.value && push!(result, c)
+ end
+ end
+
+ elseif k === XPATH_WILDCARD
+ for n in nodes
+ for c in children(n)
+ c.nodetype === Element && push!(result, c)
+ end
+ end
+
+ elseif k === XPATH_DOT
+ append!(result, nodes)
+
+ elseif k === XPATH_DOTDOT
+ for n in nodes
+ n === root && continue
+ p = _find_parent(n, root)
+ isnothing(p) || push!(result, p)
+ end
+
+ elseif k === XPATH_TEXT_FN
+ for n in nodes
+ for c in children(n)
+ c.nodetype === Text && push!(result, c)
+ end
+ end
+
+ elseif k === XPATH_NODE_FN
+ for n in nodes
+ append!(result, children(n))
+ end
+
+ elseif k === XPATH_DESCENDANT
+ # Handled by caller — collects all descendants before next step
+ error("XPATH_DESCENDANT should be handled by the evaluator, not _xpath_step")
+ end
+
+ result
+end
+
+function _descendants!(out::Vector{Node{S}}, node::Node{S}) where S
+ for c in children(node)
+ push!(out, c)
+ _descendants!(out, c)
+ end
+end
+
+function _descendants(nodes::Vector{Node{S}}) where S
+ result = Node{S}[]
+ for n in nodes
+ push!(result, n) # descendant-or-self includes self
+ _descendants!(result, n)
+ end
+ result
+end
+
+#-----------------------------------------------------------------------------# Main evaluator
+
+"""
+ xpath(node::Node, expr::AbstractString) -> Vector{Node}
+
+Evaluate an XPath expression against a `Node` tree and return matching nodes.
+
+Supports a practical subset of XPath 1.0:
+- Absolute (`/root/child`) and relative (`child/sub`) paths
+- Recursive descent (`//tag`)
+- Wildcards (`*`), self (`.`), parent (`..`)
+- Positional predicates (`[1]`, `[last()]`)
+- Attribute predicates (`[@attr]`, `[@attr='value']`)
+- `text()` and `node()` functions
+- Attribute selection (`@attr`) — returns `Text` nodes containing attribute values
+
+# Examples
+```julia
+doc = parse("", Node)
+xpath(doc, "/root/a") # both elements
+xpath(doc, "/root/a[1]") # first
+xpath(doc, "//a[@x='2']") #
+xpath(doc, "/root/b/@x") # attribute value as Text node (empty here)
+```
+"""
+function xpath(node::Node{S}, expr::AbstractString) where S
+ tokens = _xpath_tokenize(expr)
+ isempty(tokens) && return Node{S}[]
+
+ # Determine root for .. navigation
+ root = node.nodetype === Document ? node : node
+
+ i = 1
+ # Start context
+ if tokens[1].kind === XPATH_ROOT
+ # Absolute path — start from the document or its root element
+ if node.nodetype === Document
+ current = Node{S}[node]
+ else
+ current = Node{S}[node]
+ end
+ i = 2
+ else
+ current = Node{S}[node]
+ end
+
+ while i <= length(tokens)
+ tok = tokens[i]
+
+ if tok.kind === XPATH_PREDICATE
+ current = _eval_predicate(tok.value, current, root)
+ i += 1
+
+ elseif tok.kind === XPATH_DESCENDANT
+ current = _descendants(current)
+ # // must be followed by a step
+ i += 1
+
+ elseif tok.kind === XPATH_ROOT
+ # / as separator between steps — skip
+ i += 1
+
+ elseif tok.kind === XPATH_ATTRIBUTE
+ # @attr in result position — return attribute values as Text nodes
+ result = Node{S}[]
+ for n in current
+ v = get(n, tok.value, nothing)
+ !isnothing(v) && push!(result, Node{S}(Text, nothing, nothing, v, nothing))
+ end
+ current = result
+ i += 1
+
+ else
+ current = _xpath_step(current, tok, root)
+ i += 1
+ end
+ end
+
+ current
+end
diff --git a/test/runtests.jl b/test/runtests.jl
index 89978eb..1304245 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -1,646 +1,2682 @@
using XML
-using XML: Document, Element, Declaration, Comment, CData, DTD, ProcessingInstruction, Text, escape, unescape, OrderedDict, h
-using Downloads: download
+using XML: Document, Element, Declaration, Comment, CData, DTD, ProcessingInstruction, Text
+using XML: escape, unescape, h, parse_dtd
+using XML: ParsedDTD, ElementDecl, AttDecl, EntityDecl, NotationDecl
using Test
-import AbstractTrees
-
-AbstractTrees.children(x::Node) = children(x)
-
-#-----------------------------------------------------------------------------# files
-xml_xsd = joinpath("data", "xml.xsd")
-kml_xsd = joinpath("data", "kml.xsd")
-books_xml = joinpath("data", "books.xml")
-example_kml = joinpath("data", "example.kml")
-simple_dtd = joinpath("data", "simple_dtd.xml")
-
-all_files = [xml_xsd, kml_xsd, books_xml, example_kml, simple_dtd]
-
-#-----------------------------------------------------------------------------# h
-@testset "h function" begin
- @test h.tag == XML.Element("tag")
- @test h.tag(id="id") == XML.Element("tag"; id="id")
- @test h.tag(1, 2, a="a", b="b") == XML.Element("tag", 1, 2; a="a", b="b")
-end
-
-#-----------------------------------------------------------------------------# escaping/unescaping
-@testset "escaping/unescaping" begin
- s = "This > string < has & some \" special ' characters"
- @test escape(s) == "This > string < has & some " special ' characters"
- @test escape(escape(s)) == escape(s)
- @test s == unescape(escape(s))
- @test s == unescape(unescape(escape(s)))
-
- n = Element("tag", Text(s))
- @test XML.simple_value(n) == s
-
- XML.escape!(n)
- @test XML.simple_value(n) == escape(s)
-
- XML.unescape!(n)
- @test XML.simple_value(n) == s
-end
-
-#-----------------------------------------------------------------------------# DTD
-# @testset "DTDBody and friends" begin
-# s = read(simple_dtd, String)
-# data = read(simple_dtd)
-
-# dtd = XML.DTDBody(data)
-# dtd2 = parse(s, XML.DTDBody)
-
-# @test length(dtd.elements) == length(dtd2.elements) == 0
-# @test length(dtd.attributes) == length(dtd2.attributes) == 0
-# @test length(dtd.entities) == length(dtd2.entities) == 3
-
-# o = read("data/tv.dtd", XML.DTDBody)
-# end
-
-#-----------------------------------------------------------------------------# Raw
-@testset "Raw tag/attributes/value" begin
- examples = [
- (xml = "",
- nodetype = DTD,
- tag=nothing,
- attributes=nothing,
- value="html"),
- (xml = "",
- nodetype = Declaration,
- tag=nothing,
- attributes=Dict("version" => "1.0", "key" => "value"),
- value=nothing),
- (xml = "",
- nodetype = Element,
- tag="tag",
- attributes=Dict("_id" => "1", "x" => "abc"),
- value=nothing),
- (xml = "",
- nodetype = Comment,
- tag=nothing,
- attributes=nothing,
- value=" comment "),
- (xml = "",
- nodetype = CData,
- tag=nothing,
- attributes=nothing,
- value="cdata test"),
- ]
- for x in examples
- # @info "Testing: $(x.xml)"
- data = XML.next(XML.parse(x.xml, XML.Raw))
- @test XML.nodetype(data) == x.nodetype
- @test XML.tag(data) == x.tag
- @test XML.attributes(data) == x.attributes
- @test XML.value(data) == x.value
- end
-end
-
-@testset "Raw with books.xml" begin
- data = read(books_xml, XML.Raw)
- doc = collect(data)
- @test length(doc) > countlines(books_xml)
- # Check that the first 5 lines are correct
- first_5_lines = [
- XML.RawDeclaration => """""",
- XML.RawElementOpen => "",
- XML.RawElementOpen => "",
- XML.RawElementOpen => "",
- XML.RawText => "Gambardella, Matthew"
- ]
- for (i, (typ, str)) in enumerate(first_5_lines)
- dt = doc[i]
- @test dt.type == typ
- @test String(dt) == str
- end
- # Check that the last line is correct
- @test doc[end].type == XML.RawElementClose
- @test String(doc[end]) == ""
-
- @testset "next and prev" begin
- @test XML.prev(doc[1]) == data # can't use === here because prev returns a copy of ctx
- @test prev(data) === nothing
- @test XML.next(doc[end]) === nothing
-
- n = length(doc)
- next_res = [doc[1]]
- foreach(_ -> push!(next_res, XML.next(next_res[end])), 1:n-1)
-
- prev_res = [doc[end]]
- foreach(_ -> pushfirst!(prev_res, XML.prev(prev_res[1])), 1:n-1)
-
- idx = findall(next_res .!= prev_res)
-
- for (a,b) in zip(next_res, prev_res)
- @test a == b
- end
-
- lzxml = """ hello hello preserve """
- lz = XML.parse(XML.LazyNode, lzxml)
- n=XML.next(lz)
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == ""
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == "hello"
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == "hello"
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == " hello preserve "
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == "hello"
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == " hello preserve "
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == " hello "
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == " preserve "
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == " preserve "
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == ""
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == ""
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == ""
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == " preserve "
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == " preserve "
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == " hello "
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == " hello preserve "
- n=XML.next(n)
- text_content = XML.write(n)
- @test text_content == " hello "
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == " hello preserve "
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == "hello"
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == "hello"
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == ""
- n=XML.prev(n)
- text_content = XML.write(n)
- @test text_content == "\n \n hello\n hello preserve \n \n \n"
- end
-
- @testset "depth and parent" begin
- @test XML.depth(data) == 0
- @test isnothing(XML.parent(data))
- @test XML.depth(doc[1]) == 1
- @test XML.parent(doc[1]) == data
- @test XML.depth(doc[2]) == 1
- @test XML.depth(doc[3]) == 2
- @test XML.parent(doc[3]) == doc[2]
- @test XML.depth(doc[end]) == 1
- @test XML.parent(doc[end]) == data
- end
-
- @testset "tag/attributes/value" begin
- x = doc[1] #
- @test XML.tag(x) === nothing
- @test XML.attributes(x) == Dict("version" => "1.0")
- @test XML.value(x) === nothing
-
- x = XML.next(x) #
- @test XML.tag(x) == "catalog"
- @test XML.attributes(x) === nothing
- @test XML.value(x) === nothing
-
- x = XML.next(x) #
- @test XML.tag(x) == "book"
- @test XML.attributes(x) == Dict("id" => "bk101")
- @test XML.value(x) === nothing
-
- x = XML.next(x) #
- @test XML.tag(x) == "author"
- @test XML.attributes(x) === nothing
- @test XML.value(x) === nothing
-
- x = XML.next(x) # Gambardella, Matthew
- @test XML.tag(x) === nothing
- @test XML.attributes(x) === nothing
- @test XML.value(x) == "Gambardella, Matthew"
- end
-end
-
-#-----------------------------------------------------------------------------# Preserve whitespace
-@testset "xml:space" begin
- @testset "Basic xml:space functionality" begin
-
- # Test 1: xml:space="preserve" should preserve entirely empty whitespace
- xml1 = """ """
- doc1 = parse(XML.Node, xml1)
- text_content = XML.value(doc1[1][1][1])
- @test text_content == " "
-
- # Test 2: xml:space="preserve" should preserve leading and trailing whitespace
- xml2 = """ leading and trailing spaces """
- doc2 = parse(XML.Node, xml2)
- text_content = XML.value(doc2[1][1][1])
- @test text_content == " leading and trailing spaces "
-
- # Test 3: Entirely empty tags with and without xml:space="preserve" become self-closing
- xml3 = """ """
- doc3 = XML.parse(XML.Node, xml3)
- text_content = XML.write(doc3[1][1])
- @test text_content == "" # without xml:space="preserve", empty text becomes self-closing
- text_content = XML.value(doc3[1][2][1])
- @test text_content == " " # with xml:space, whitespace is preserved
- text_content = XML.write(doc3[1][3])
- @test text_content == "" # with xml:space="preserve", empty text becomes self-closing
-
- # Test 4: Without xml:space, whitespace should be normalized
- xml4 = """ gets normalized """
- doc4 = XML.parse(XML.Node, xml4)
- text_content = XML.value(doc4[1][1][1])
- @test text_content == "gets normalized"
-
- # Test 5: xml:space="default" should normalize even with preserve_xml_space=true
- xml5 = """ gets normalized """
- doc5 = XML.parse(XML.Node, xml5)
- text_content = XML.value(doc5[1][1][1])
- @test text_content == "gets normalized"
- end
-
- @testset "xml:space inheritance" begin
- # Test 6: Children inherit parent's xml:space="preserve"
- xml6 = """
- parent text
- child text
-
- """
- doc6 = XML.parse(XML.Node, xml6)
- # Both parent and child should preserve whitespace
- @test contains(XML.value(doc6[1][2][1]), "parent text \n")
- @test XML.value(doc6[1][2][2][1]) == " child text "
-
- # Test 7: xml:space="default" overrides parent's "preserve"
- xml7 = """
- normalized despite parent
- """
- doc7 = XML.parse(XML.Node, xml7)
- @test XML.value(doc7[1][2][1]) == "normalized despite parent"
- end
-
- @testset "Nesting scenarios" begin
- # Test 8: Multiple levels of xml:space changes
- xml8 = """
- preserved
- normalized
- preserved again
-
-
- """
- doc8 = XML.parse(XML.Node, xml8)
-
- # level1 should preserve (inherits from root)
- level1_text = XML.value(doc8[1][2][1])
- @test level1_text == " preserved \n "
-
- # level2 should normalize (explicit xml:space="default")
- level2_text = XML.value(doc8[1][2][2][1])
- @test level2_text == "normalized"
-
- # level3 should preserve (explicit xml:space="preserve")
- level3_text = XML.value(doc8[1][2][2][2][1])
- @test level3_text == " preserved again "
-
- # Test 9: repeated multiple levels of xml:space changes
- xml9 = """
- preserved
- normalized
- preserved again
-
-
- preserved b
- normalized b
- preserved again b
-
-
- """
- doc9 = XML.parse(XML.Node, xml9)
-
- # level1b should preserve (inherits from root)
- level1b_text = XML.value(doc9[1][4][1])
- @test level1b_text == " preserved b \n "
-
- # level2 should normalize (explicit xml:space="default")
- level2b_text = XML.value(doc9[1][4][2][1])
- @test level2b_text == "normalized b"
-
- # level3 should preserve (explicit xml:space="preserve")
- level3b_text = XML.value(doc9[1][4][2][2][1])
- @test level3b_text == " preserved again b "
-
- # Test 10: futher repeated multiple levels of xml:space changes
- xml10 = """
- normalized
- normalized b
- preserved
-
-
- normalized c
- preserved b
- normalized again b
- preserved c
-
-
-
- normalized d
- """
- doc10 = XML.parse(XML.Node, xml10)
-
- # level1 should normalize (as root)
- level1_text = XML.value(doc10[end][1][1])
- @test level1_text == "normalized"
-
- # level2 should normalize (as root and level1)
- level2_text = XML.value(doc10[end][1][2][1])
- @test level2_text == "normalized b"
-
- # level3 should preserve (explicit xml:space="preserve")
- level3_text = XML.value(doc10[end][1][2][2][1])
- @test level3_text == " preserved "
-
- # level1b should normalize (as root)
- level1b_text = XML.value(doc10[end][2][1])
- @test level1b_text == "normalized c"
-
- # level2b should preserve (explicit xml:space="preserve")
- level2b_text = XML.value(doc10[end][2][2][1])
- @test level2b_text == " preserved b \n "
-
- # level3 should normalize (explicit xml:space="default")
- level3b_text = XML.value(doc10[end][2][2][2][1])
- @test level3b_text == "normalized again b"
-
- # level3c should preserve (inherited from level2b)
- level3c_text = XML.value(doc10[end][2][2][4][1])
- @test level3c_text == " preserved c \n "
-
- # level1c should normalize (as root)
- level1c_text = XML.value(doc10[end][3][1])
- @test level1c_text == "normalized d"
- end
- @testset "inter-element gap semantics" begin
- # Default parent: gap between siblings should be dropped
- s1 = """ x
- y """
- d1 = XML.parse(XML.Node, s1)
- @test length(d1[1]) == 2
- @test XML.value(d1[1][1][1]) == "x"
- @test XML.value(d1[1][2][1]) == "y"
-
- # Preserve parent, default child ends: gap after default child dropped
- s2 = """
- keep
- norm
- after default gap
- """
- d2 = XML.parse(XML.Node, s2)
- @test length(d2[1]) == 7
- @test XML.value(d2[1][1]) == "\n "
- @test XML.value(d2[1][2][1]) == " keep "
- @test XML.value(d2[1][3]) == "\n "
- @test XML.value(d2[1][4][1]) == "norm"
- @test XML.value(d2[1][5]) == "\n "
- @test XML.value(d2[1][6][1]) == " after default gap "
- @test XML.value(d2[1][7]) == "\n"
- end
- @testset "XML whitespace vs Unicode whitespace" begin
+
+#==============================================================================#
+# ESCAPE / UNESCAPE #
+#==============================================================================#
+@testset "escape / unescape" begin
+ @testset "all five predefined entities" begin
+ @test escape("&") == "&"
+ @test escape("<") == "<"
+ @test escape(">") == ">"
+ @test escape("'") == "'"
+ @test escape("\"") == """
+ end
+
+ @testset "unescape reverses escape" begin
+ @test unescape("&") == "&"
+ @test unescape("<") == "<"
+ @test unescape(">") == ">"
+ @test unescape("'") == "'"
+ @test unescape(""") == "\""
+ end
+
+ @testset "roundtrip on mixed strings" begin
+ s = "This > string < has & some \" special ' characters"
+ @test unescape(escape(s)) == s
+ end
+
+ @testset "idempotent unescape" begin
+ s = "plain text with no entities"
+ @test unescape(s) == s
+ end
+
+ @testset "multiple entities in one string" begin
+ @test escape("a < b & c > d") == "a < b & c > d"
+ @test unescape("a < b & c > d") == "a < b & c > d"
+ end
+
+ @testset "empty string" begin
+ @test escape("") == ""
+ @test unescape("") == ""
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.1: Well-Formed XML Documents #
+#==============================================================================#
+@testset "Spec 2.1: Well-Formed XML Documents" begin
+ # The spec's simplest example:
+ #
+ # Hello, world!
+ xml = """Hello, world!"""
+ doc = parse(xml, Node)
+ @test nodetype(doc) == Document
+ @test length(doc) == 2 # Declaration + Element
+ @test nodetype(doc[1]) == Declaration
+ @test nodetype(doc[2]) == Element
+ @test tag(doc[2]) == "greeting"
+ @test simple_value(doc[2]) == "Hello, world!"
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.4: Character Data and Markup #
+#==============================================================================#
+@testset "Spec 2.4: Character Data and Markup" begin
+ @testset "text content between tags" begin
+ doc = parse("Hello", Node)
+ @test simple_value(doc[1]) == "Hello"
+ end
+
+ @testset "entity references in text are unescaped" begin
+ doc = parse("& < > ' "", Node)
+ @test simple_value(doc[1]) == "& < > ' \""
+ end
+
+ @testset "mixed text and child elements" begin
+ doc = parse("Hello world!
", Node)
+ root = doc[1]
+ @test length(root) == 3
+ @test nodetype(root[1]) == Text
+ @test value(root[1]) == "Hello "
+ @test nodetype(root[2]) == Element
+ @test tag(root[2]) == "b"
+ @test simple_value(root[2]) == "world"
+ @test nodetype(root[3]) == Text
+ @test value(root[3]) == "!"
+ end
+
+ @testset "empty element has no text" begin
+ doc = parse("", Node)
+ @test length(children(doc[1])) == 0
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.5: Comments #
+#==============================================================================#
+@testset "Spec 2.5: Comments" begin
+ @testset "basic comment (spec example)" begin
+ # Spec example:
+ doc = parse("", Node)
+ c = doc[1][1]
+ @test nodetype(c) == Comment
+ @test value(c) == " declarations for & "
+ end
+
+ @testset "empty comment" begin
+ doc = parse("", Node)
+ c = doc[1][1]
+ @test nodetype(c) == Comment
+ @test value(c) == ""
+ end
+
+ @testset "comment before root element" begin
+ doc = parse("", Node)
+ @test nodetype(doc[1]) == Comment
+ @test value(doc[1]) == " before "
+ @test nodetype(doc[2]) == Element
+ end
+
+ @testset "comment after root element" begin
+ doc = parse("", Node)
+ @test nodetype(doc[1]) == Element
+ @test nodetype(doc[2]) == Comment
+ end
+
+ @testset "comment with markup-like content preserved verbatim" begin
+ doc = parse("", Node)
+ @test value(doc[1][1]) == " not a tag "
+ end
+
+ @testset "multiple comments" begin
+ doc = parse("", Node)
+ @test length(doc[1]) == 2
+ @test value(doc[1][1]) == " A "
+ @test value(doc[1][2]) == " B "
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.6: Processing Instructions #
+#==============================================================================#
+@testset "Spec 2.6: Processing Instructions" begin
+ @testset "xml-stylesheet PI (spec example)" begin
+ doc = parse("""""", Node)
+ pi = doc[1]
+ @test nodetype(pi) == ProcessingInstruction
+ @test tag(pi) == "xml-stylesheet"
+ @test contains(value(pi), "type=\"text/xsl\"")
+ end
+
+ @testset "PI with no content" begin
+ doc = parse("", Node)
+ pi = doc[1]
+ @test nodetype(pi) == ProcessingInstruction
+ @test tag(pi) == "target"
+ @test value(pi) === nothing
+ end
+
+ @testset "PI inside element" begin
+ doc = parse("", Node)
+ pi = doc[1][1]
+ @test nodetype(pi) == ProcessingInstruction
+ @test tag(pi) == "mypi"
+ @test value(pi) == "some data"
+ end
+
+ @testset "PI after root element" begin
+ doc = parse("", Node)
+ @test nodetype(doc[2]) == ProcessingInstruction
+ @test tag(doc[2]) == "post-process"
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.7: CDATA Sections #
+#==============================================================================#
+@testset "Spec 2.7: CDATA Sections" begin
+ @testset "CDATA preserves markup characters" begin
+ # Spec example
+ doc = parse("Hello, world!]]>", Node)
+ cd = doc[1][1]
+ @test nodetype(cd) == CData
+ @test value(cd) == "Hello, world!"
+ end
+
+ @testset "empty CDATA" begin
+ doc = parse("", Node)
+ cd = doc[1][1]
+ @test nodetype(cd) == CData
+ @test value(cd) == ""
+ end
+
+ @testset "CDATA with ampersands and less-thans" begin
+ doc = parse(" d]]>", Node)
+ @test value(doc[1][1]) == "a < b && c > d"
+ end
+
+ @testset "CDATA with special characters" begin
+ doc = parse("", Node)
+ @test value(doc[1][1]) == "line1\nline2\ttab"
+ end
+
+ @testset "CDATA mixed with text" begin
+ doc = parse("beforeafter", Node)
+ @test length(doc[1]) == 3
+ @test nodetype(doc[1][1]) == Text
+ @test value(doc[1][1]) == "before"
+ @test nodetype(doc[1][2]) == CData
+ @test value(doc[1][2]) == "inside"
+ @test nodetype(doc[1][3]) == Text
+ @test value(doc[1][3]) == "after"
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.8: Prolog and Document Type Declaration #
+#==============================================================================#
+@testset "Spec 2.8: Prolog and Document Type Declaration" begin
+ @testset "XML declaration - version only" begin
+ doc = parse("""""", Node)
+ decl = doc[1]
+ @test nodetype(decl) == Declaration
+ @test decl["version"] == "1.0"
+ end
+
+ @testset "XML declaration - version and encoding" begin
+ doc = parse("""""", Node)
+ decl = doc[1]
+ @test decl["version"] == "1.0"
+ @test decl["encoding"] == "UTF-8"
+ end
+
+ @testset "XML declaration - all three pseudo-attributes" begin
+ doc = parse("""""", Node)
+ decl = doc[1]
+ @test decl["version"] == "1.0"
+ @test decl["encoding"] == "UTF-8"
+ @test decl["standalone"] == "yes"
+ end
+
+ @testset "XML declaration with single quotes" begin
+ doc = parse("", Node)
+ @test doc[1]["version"] == "1.0"
+ end
+
+ @testset "no XML declaration" begin
+ doc = parse("", Node)
+ @test length(doc) == 1
+ @test nodetype(doc[1]) == Element
+ end
+
+ @testset "DOCTYPE - SYSTEM" begin
+ # Spec example
+ doc = parse("""""", Node)
+ dtd = doc[1]
+ @test nodetype(dtd) == DTD
+ @test contains(value(dtd), "greeting")
+ @test contains(value(dtd), "SYSTEM")
+ @test contains(value(dtd), "hello.dtd")
+ end
+
+ @testset "DOCTYPE - with internal subset" begin
+ xml = """
+]>Hello, world!"""
+ doc = parse(xml, Node)
+ dtd = doc[1]
+ @test nodetype(dtd) == DTD
+ @test contains(value(dtd), "greeting")
+ @test contains(value(dtd), "
+
+
+]>"""
+ doc = parse(xml, Node)
+ @test nodetype(doc[1]) == DTD
+ @test contains(value(doc[1]), "ENTITY")
+ end
+
+ @testset "full prolog: declaration + DOCTYPE" begin
+ xml = """"""
+ doc = parse(xml, Node)
+ @test nodetype(doc[1]) == Declaration
+ @test nodetype(doc[2]) == DTD
+ @test nodetype(doc[3]) == Element
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.9: Standalone Document Declaration #
+#==============================================================================#
+@testset "Spec 2.9: Standalone Document Declaration" begin
+ doc = parse("""""", Node)
+ @test doc[1]["standalone"] == "yes"
+
+ doc2 = parse("""""", Node)
+ @test doc2[1]["standalone"] == "no"
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 2.10: White Space Handling #
+#==============================================================================#
+@testset "Spec 2.10: White Space Handling" begin
+ @testset "parser preserves all text content verbatim" begin
+ doc = parse(" hello ", Node)
+ @test simple_value(doc[1]) == " hello "
+ end
+
+ @testset "parser preserves whitespace-only text" begin
+ doc = parse(" ", Node)
+ @test simple_value(doc[1]) == " "
+ end
+
+ @testset "parser preserves inter-element whitespace as Text nodes" begin
+ xml = "x\n y"
+ doc = parse(xml, Node)
+ @test length(doc[1]) == 3
+ @test value(doc[1][1][1]) == "x"
+ @test nodetype(doc[1][2]) == Text
+ @test value(doc[1][2]) == "\n "
+ @test value(doc[1][3][1]) == "y"
+ end
+
+ @testset "xml:space attribute is preserved during parsing" begin
+ doc = parse(""" text """, Node)
+ @test doc[1]["xml:space"] == "preserve"
+ @test value(doc[1][1][1]) == " text "
+ end
+
+ @testset "xml:space='preserve' affects write formatting" begin
+ # When xml:space="preserve", writer doesn't add indentation
+ el = Element("s", XML.Text(" pre "), Element("t"), XML.Text(" post "); var"xml:space"="preserve")
+ @test XML.write(el) == " pre post "
+ end
+
+ @testset "write formats with indentation by default" begin
+ el = Element("root", Element("a"), Element("b"))
+ s = XML.write(el)
+ @test contains(s, " ") # indented
+ @test contains(s, " ") # indented
+ end
+
+ @testset "Unicode non-breaking space is NOT XML whitespace" begin
nbsp = "\u00A0"
- s = """
- x\t\n
- $(nbsp) y $(nbsp)
- $(nbsp) z $(nbsp)
- """
- d = XML.parse(XML.Node, s)
- @test XML.value(d[1][1][1]) == "x"
- @test XML.value(d[1][2][1]) == "$(nbsp) y $(nbsp)"
- @test XML.value(d[1][3][1]) == "$(nbsp) z $(nbsp)"
- end
-
- @testset "CDATA/Comment/PI boundaries" begin
- s = """
- pre post
- pre post
-
- """
- d = XML.parse(XML.Node, s)
- @test XML.value(d[1][1][1]) == "pre"
- @test nodetype(d[1][1][2]) == XML.CData
- @test XML.value(d[1][1][3]) == "post"
- @test XML.value(d[1][2][1]) == " pre "
- @test nodetype(d[1][2][2]) == XML.Comment
- @test XML.value(d[1][2][3]) == " post "
- @test nodetype(d[1][3]) == XML.ProcessingInstruction
- end
-
- @testset "nested toggles and sibling sequences" begin
- s = """
- a
- b
- c
-
- d
- e
-
- """
- d = XML.parse(XML.Node, s)
- @test XML.value(d[1][2][1]) == " a \n "
- @test XML.value(d[1][2][2][1]) == "b"
- @test XML.value(d[1][2][2][2][1]) == " c "
- @test d[1][2][4].tag == "y2"
- @test XML.value(d[1][2][4][1]) == "d"
- @test d[1][2][6].tag == "w"
- @test XML.value(d[1][2][6][1]) == " e "
- end
-
- @testset "root/document boundaries" begin
- s = "\n \n a \n \t "
- d = XML.parse(XML.Node, s)
- @test length(d) == 1
- @test XML.value(d[1][1]) == "a"
- end
-
- @testset "entities expanding to whitespace" begin
- chr1="\u0020"
- chr2="\u000A"
- chr3="\u00A0"
-
- s = """
- $(chr1) a $(chr2)
- $(chr1) b $(chr2)
- $(chr3)c$(chr3)
- """
- d = XML.parse(XML.Node, s)
- @test XML.value(d[1][1][1]) == "a"
- @test XML.value(d[1][2][1]) == " b \n"
- @test XML.value(d[1][3][1]) == "$(chr3)c$(chr3)"
- end
-
- @testset "invalid values and placement" begin
- s_bad = """ t """
- @test_throws ErrorException XML.parse(XML.Node, s_bad)
-
- s_pi = """ t """
- d = XML.parse(XML.Node, s_pi)
- @test XML.value(d[end][1]) == "t"
-
- s_dup = """ t """
-# @test_throws ErrorException XML.parse(XML.Node, s_dup)
- end
-
- @testset "prev()/next() symmetry" begin
- xml = """
- a b c
- d e f
- i
- """
- r = XML.parse(XML.LazyNode, xml).raw
- toks=XML.Raw[]
- while true
- n = XML.next(r)
- n === nothing && break
- push!(toks, n)
- r=n
- end
- back = XML.Raw[]
- r = toks[end]
- while true
- p = XML.prev(r)
- p === nothing && break
- push!(back, p)
- r = p
- end
- @test reverse(back)[2:end] == toks[1:end-1]
- end
-
- @testset "write/read roundtrip extremes" begin
- xml = """
-
-
- r
- pre post
- """
- n = XML.parse(XML.Node, xml)
- io = IOBuffer(); XML.write(io, n)
- n2 = XML.parse(XML.Node, String(take!(io)))
- @test n == n2
- @test XML.write(n2[1][1]) == "
"
- @test XML.write(n2[1][2]) == ""
- @test XML.value(n2[1][3][1]) == "r"
- @test XML.write(n2[1][4]) == " pre post "
- end
-
- @testset "self-closing/empty/whitespace-only children" begin
- s = """
-
-
-
-
- x y
- """
- d = XML.parse(XML.Node, s)
- @test XML.write(d[1][1]) == ""
- @test XML.write(d[1][2]) == ""
- @test XML.value(d[1][3][1]) == " "
- @test XML.value(d[1][5][1]) == "x"
- @test XML.value(d[1][5][3]) == "y"
- end
-
- @testset "allocation guard: small xml:space doc" begin
- xml = " x y "
- f() = XML.parse(XML.Node, xml)
- a = @allocated f()
- @test a < 500_000 # tune for CI
- end
-
-end
-
-#-----------------------------------------------------------------------------# roundtrip
-@testset "read/write/read roundtrip" begin
+ xml = "$(nbsp) y $(nbsp)"
+ doc = parse(xml, Node)
+ @test simple_value(doc[1]) == "$(nbsp) y $(nbsp)"
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 3.1: Start-Tags, End-Tags, Empty-Element Tags #
+#==============================================================================#
+@testset "Spec 3.1: Start-Tags, End-Tags, Empty-Element Tags" begin
+ @testset "element with attributes (spec example)" begin
+ #
+ doc = parse("""A dog.""", Node)
+ el = doc[1]
+ @test tag(el) == "termdef"
+ @test el["id"] == "dt-dog"
+ @test el["term"] == "dog"
+ @test value(el[1]) == "A dog."
+ end
+
+ @testset "self-closing tag (spec example)" begin
+ #
+ doc = parse("""
""", Node)
+ el = doc[1]
+ @test tag(el) == "IMG"
+ @test el["align"] == "left"
+ @test el["src"] == "http://www.w3.org/Icons/WWW/w3c_home"
+ @test length(children(el)) == 0
+ end
+
+ @testset "simple self-closing tag" begin
+ doc = parse("
", Node)
+ @test tag(doc[1]) == "br"
+ @test length(children(doc[1])) == 0
+ end
+
+ @testset "self-closing tag with space before />" begin
+ doc = parse("
", Node)
+ @test tag(doc[1]) == "br"
+ end
+
+ @testset "empty element with start and end tag" begin
+ doc = parse("", Node)
+ el = doc[1]
+ @test tag(el) == "empty"
+ @test isnothing(el.children)
+ end
+
+ @testset "nested elements" begin
+ doc = parse("", Node)
+ @test tag(doc[1]) == "a"
+ @test tag(doc[1][1]) == "b"
+ @test tag(doc[1][1][1]) == "c"
+ end
+
+ @testset "sibling elements" begin
+ doc = parse("", Node)
+ @test length(doc[1]) == 3
+ @test tag(doc[1][1]) == "a"
+ @test tag(doc[1][2]) == "b"
+ @test tag(doc[1][3]) == "c"
+ end
+
+ @testset "attributes with single quotes" begin
+ doc = parse("", Node)
+ @test doc[1]["a"] == "val"
+ end
+
+ @testset "attributes with double quotes" begin
+ doc = parse("""""", Node)
+ @test doc[1]["a"] == "val"
+ end
+
+ @testset "mixed quote styles in attributes" begin
+ doc = parse("""""", Node)
+ @test doc[1]["a"] == "1"
+ @test doc[1]["b"] == "2"
+ end
+
+ @testset "attribute with > in value" begin
+ doc = parse("""""", Node)
+ @test doc[1]["a"] == "1>2"
+ end
+
+ @testset "attribute with entity reference" begin
+ doc = parse("""""", Node)
+ @test doc[1]["a"] == "a&b"
+ end
+
+ @testset "multiple attributes accessible via attributes()" begin
+ doc = parse("""""", Node)
+ attrs = attributes(doc[1])
+ @test attrs isa Dict
+ @test attrs["first"] == "1"
+ @test attrs["second"] == "2"
+ @test attrs["third"] == "3"
+ end
+
+ @testset "whitespace around = in attributes" begin
+ doc = parse("""""", Node)
+ @test doc[1]["a"] == "1"
+ end
+end
+
+#==============================================================================#
+# XML 1.0 SPEC SECTION 4.1: Entity References #
+#==============================================================================#
+@testset "Spec 4.1: Character and Entity References" begin
+ @testset "predefined entity references in text" begin
+ doc = parse("<", Node)
+ @test simple_value(doc[1]) == "<"
+
+ doc = parse(">", Node)
+ @test simple_value(doc[1]) == ">"
+
+ doc = parse("&", Node)
+ @test simple_value(doc[1]) == "&"
+
+ doc = parse("'", Node)
+ @test simple_value(doc[1]) == "'"
+
+ doc = parse(""", Node)
+ @test simple_value(doc[1]) == "\""
+ end
+
+ @testset "predefined entities in attribute values" begin
+ doc = parse("""""", Node)
+ @test doc[1]["a"] == "<>&'\""
+ end
+
+ @testset "multiple entity references in one text node" begin
+ doc = parse("<tag> & "value"", Node)
+ @test simple_value(doc[1]) == " & \"value\""
+ end
+end
+
+#==============================================================================#
+# NAMESPACES (Colon in Tag and Attribute Names) #
+#==============================================================================#
+@testset "Namespaces" begin
+ @testset "namespaced element" begin
+ doc = parse("""""", Node)
+ @test tag(doc[1]) == "ns:root"
+ @test doc[1]["xmlns:ns"] == "http://example.com"
+ @test tag(doc[1][1]) == "ns:child"
+ end
+
+ @testset "default namespace" begin
+ doc = parse("""""", Node)
+ @test doc[1]["xmlns"] == "http://example.com"
+ end
+
+ @testset "multiple namespace prefixes" begin
+ xml = """"""
+ doc = parse(xml, Node)
+ @test tag(doc[1][1]) == "a:x"
+ @test tag(doc[1][2]) == "b:y"
+ end
+end
+
+#==============================================================================#
+# NODE CONSTRUCTORS #
+#==============================================================================#
+@testset "Node Constructors" begin
+ @testset "Text" begin
+ t = Text("hello")
+ @test nodetype(t) == Text
+ @test value(t) == "hello"
+ @test tag(t) === nothing
+ @test attributes(t) === nothing
+ end
+
+ @testset "Comment" begin
+ c = Comment(" a comment ")
+ @test nodetype(c) == Comment
+ @test value(c) == " a comment "
+ end
+
+ @testset "CData" begin
+ cd = CData("raw ")
+ @test nodetype(cd) == CData
+ @test value(cd) == "raw "
+ end
+
+ @testset "DTD" begin
+ d = DTD("html")
+ @test nodetype(d) == DTD
+ @test value(d) == "html"
+ end
+
+ @testset "Declaration" begin
+ decl = Declaration(; version="1.0", encoding="UTF-8")
+ @test nodetype(decl) == Declaration
+ @test decl["version"] == "1.0"
+ @test decl["encoding"] == "UTF-8"
+ end
+
+ @testset "Declaration with no attributes" begin
+ decl = Declaration()
+ @test nodetype(decl) == Declaration
+ @test attributes(decl) === nothing
+ end
+
+ @testset "ProcessingInstruction with content" begin
+ pi = ProcessingInstruction("target", "data here")
+ @test nodetype(pi) == ProcessingInstruction
+ @test tag(pi) == "target"
+ @test value(pi) == "data here"
+ end
+
+ @testset "ProcessingInstruction without content" begin
+ pi = ProcessingInstruction("target")
+ @test nodetype(pi) == ProcessingInstruction
+ @test tag(pi) == "target"
+ @test value(pi) === nothing
+ end
+
+ @testset "Element with tag only" begin
+ el = Element("div")
+ @test nodetype(el) == Element
+ @test tag(el) == "div"
+ @test length(children(el)) == 0
+ end
+
+ @testset "Element with children" begin
+ el = Element("div", Text("hello"), Element("span"))
+ @test length(el) == 2
+ @test nodetype(el[1]) == Text
+ @test nodetype(el[2]) == Element
+ end
+
+ @testset "Element with attributes" begin
+ el = Element("div"; class="main", id="content")
+ @test el["class"] == "main"
+ @test el["id"] == "content"
+ end
+
+ @testset "Element with children and attributes" begin
+ el = Element("a", "click here"; href="http://example.com")
+ @test tag(el) == "a"
+ @test el["href"] == "http://example.com"
+ @test value(el[1]) == "click here"
+ end
+
+ @testset "Element auto-converts non-Node children to Text" begin
+ el = Element("p", 42)
+ @test nodetype(el[1]) == Text
+ @test value(el[1]) == "42"
+ end
+
+ @testset "Document" begin
+ doc = Document(
+ Declaration(; version="1.0"),
+ Element("root")
+ )
+ @test nodetype(doc) == Document
+ @test length(doc) == 2
+ @test nodetype(doc[1]) == Declaration
+ @test nodetype(doc[2]) == Element
+ end
+
+ @testset "Document with all node types" begin
+ doc = Document(
+ Declaration(; version="1.0"),
+ DTD("root"),
+ Comment("comment"),
+ ProcessingInstruction("pi", "data"),
+ Element("root", CData("cdata"), Text("text"))
+ )
+ @test map(nodetype, children(doc)) == [Declaration, DTD, Comment, ProcessingInstruction, Element]
+ @test length(doc[end]) == 2
+ @test nodetype(doc[end][1]) == CData
+ @test value(doc[end][1]) == "cdata"
+ @test nodetype(doc[end][2]) == Text
+ @test value(doc[end][2]) == "text"
+ end
+
+ @testset "invalid constructions" begin
+ @test_throws Exception Text("a", "b") # too many args
+ @test_throws Exception Comment("a"; x="1") # no attrs
+ @test_throws Exception CData("a"; x="1") # no attrs
+ @test_throws Exception DTD("a"; x="1") # no attrs
+ @test_throws Exception Element() # need tag
+ @test_throws Exception Declaration("bad") # no positional args
+ @test_throws Exception Document(; x="1") # no attrs
+ @test_throws Exception ProcessingInstruction() # need target
+ @test_throws Exception ProcessingInstruction("a", "b", "c") # too many args
+ end
+end
+
+#==============================================================================#
+# h CONSTRUCTOR #
+#==============================================================================#
+@testset "h constructor" begin
+ @testset "h(tag)" begin
+ el = h("div")
+ @test nodetype(el) == Element
+ @test tag(el) == "div"
+ end
+
+ @testset "h(tag, children...)" begin
+ el = h("div", "hello")
+ @test simple_value(el) == "hello"
+ end
+
+ @testset "h(tag; attrs...)" begin
+ el = h("div"; class="main")
+ @test el["class"] == "main"
+ end
+
+ @testset "h(tag, children...; attrs...)" begin
+ el = h("div", "hello"; class="main")
+ @test el["class"] == "main"
+ @test value(el[1]) == "hello"
+ end
+
+ @testset "h.tag syntax" begin
+ el = h.div("hello"; class="main")
+ @test tag(el) == "div"
+ @test el["class"] == "main"
+ @test value(el[1]) == "hello"
+ end
+
+ @testset "h.tag with no args" begin
+ el = h.br()
+ @test tag(el) == "br"
+ @test length(children(el)) == 0
+ end
+
+ @testset "h.tag with only attrs" begin
+ el = h.img(; src="image.png")
+ @test tag(el) == "img"
+ @test el["src"] == "image.png"
+ end
+
+ @testset "nested h constructors" begin
+ el = h.div(
+ h.h1("Title"),
+ h.p("Paragraph")
+ )
+ @test tag(el) == "div"
+ @test length(el) == 2
+ @test tag(el[1]) == "h1"
+ @test tag(el[2]) == "p"
+ end
+
+ @testset "h with symbol tag" begin
+ el = h(:div)
+ @test tag(el) == "div"
+ end
+end
+
+#==============================================================================#
+# NODE INTERFACE #
+#==============================================================================#
+@testset "Node Interface" begin
+ doc = parse("""text""", Node)
+
+ @testset "nodetype" begin
+ @test nodetype(doc) == Document
+ @test nodetype(doc[1]) == Declaration
+ @test nodetype(doc[2]) == Element
+ end
+
+ @testset "tag" begin
+ @test tag(doc) === nothing
+ @test tag(doc[2]) == "root"
+ @test tag(doc[2][1]) == "child"
+ end
+
+ @testset "attributes" begin
+ @test attributes(doc) === nothing
+ @test attributes(doc[2])["attr"] == "val"
+ end
+
+ @testset "value" begin
+ @test value(doc) === nothing
+ @test value(doc[2][1][1]) == "text"
+ end
+
+ @testset "children" begin
+ @test length(children(doc)) == 2
+ @test length(children(doc[2])) == 1
+ end
+
+ @testset "is_simple" begin
+ @test is_simple(doc[2][1]) == true
+ @test is_simple(doc[2]) == false
+ end
+
+ @testset "simple_value" begin
+ @test simple_value(doc[2][1]) == "text"
+ @test_throws ErrorException simple_value(doc[2])
+ end
+
+ @testset "simple_value for CData child" begin
+ el = Element("x", CData("data"))
+ @test is_simple(el)
+ @test simple_value(el) == "data"
+ end
+end
+
+#==============================================================================#
+# NODE INDEXING #
+#==============================================================================#
+@testset "Node Indexing" begin
+ doc = parse("", Node)
+ root = doc[1]
+
+ @testset "integer indexing" begin
+ @test tag(root[1]) == "a"
+ @test tag(root[2]) == "b"
+ @test tag(root[3]) == "c"
+ end
+
+ @testset "colon indexing" begin
+ all = root[:]
+ @test length(all) == 3
+ end
+
+ @testset "lastindex" begin
+ @test tag(root[end]) == "c"
+ end
+
+ @testset "only" begin
+ single = parse("", Node)
+ @test tag(only(single[1])) == "only"
+ end
+
+ @testset "length" begin
+ @test length(root) == 3
+ end
+
+ @testset "attribute indexing" begin
+ el = parse("""""", Node)[1]
+ @test el["a"] == "1"
+ @test el["b"] == "2"
+ @test_throws KeyError el["nonexistent"]
+ end
+
+ @testset "haskey" begin
+ el = parse("""""", Node)[1]
+ @test haskey(el, "a") == true
+ @test haskey(el, "b") == false
+ end
+
+ @testset "keys" begin
+ el = parse("""""", Node)[1]
+ @test collect(keys(el)) == ["a", "b"]
+ end
+
+ @testset "keys on element with no attributes" begin
+ el = parse("", Node)[1]
+ @test isempty(keys(el))
+ end
+end
+
+#==============================================================================#
+# NODE MUTATION #
+#==============================================================================#
+@testset "Node Mutation" begin
+ @testset "setindex! child" begin
+ el = Element("root", Element("old"))
+ el[1] = Element("new")
+ @test tag(el[1]) == "new"
+ end
+
+ @testset "setindex! child with auto-conversion" begin
+ el = Element("root", Text("old"))
+ el[1] = "new text"
+ @test value(el[1]) == "new text"
+ end
+
+ @testset "setindex! attribute" begin
+ el = Element("root"; a="1")
+ el["a"] = "2"
+ @test el["a"] == "2"
+ end
+
+ @testset "setindex! new attribute" begin
+ el = Element("root"; a="1")
+ el["b"] = "2"
+ @test el["b"] == "2"
+ end
+
+ @testset "push! child" begin
+ el = Element("root")
+ push!(el, Element("child"))
+ @test length(el) == 1
+ @test tag(el[1]) == "child"
+ end
+
+ @testset "push! with auto-conversion" begin
+ el = Element("root")
+ push!(el, "text")
+ @test nodetype(el[1]) == Text
+ @test value(el[1]) == "text"
+ end
+
+ @testset "pushfirst! child" begin
+ el = Element("root", Element("second"))
+ pushfirst!(el, Element("first"))
+ @test tag(el[1]) == "first"
+ @test tag(el[2]) == "second"
+ end
+
+ @testset "push! on non-container node errors" begin
+ t = Text("hello")
+ @test_throws ErrorException push!(t, "more")
+ end
+end
+
+#==============================================================================#
+# NODE EQUALITY #
+#==============================================================================#
+@testset "Node Equality" begin
+ @testset "identical elements are equal" begin
+ a = Element("div", Text("hello"); class="main")
+ b = Element("div", Text("hello"); class="main")
+ @test a == b
+ end
+
+ @testset "different tag names are not equal" begin
+ @test Element("a") != Element("b")
+ end
+
+ @testset "different attributes are not equal" begin
+ @test Element("a"; x="1") != Element("a"; x="2")
+ end
+
+ @testset "different children are not equal" begin
+ @test Element("a", Text("x")) != Element("a", Text("y"))
+ end
+
+ @testset "different node types are not equal" begin
+ @test Text("x") != Comment("x")
+ end
+
+ @testset "empty attributes vs nothing" begin
+ a = Element("a")
+ b = Element("a")
+ @test a == b
+ end
+
+ @testset "parse equality" begin
+ xml = "text"
+ @test parse(xml, Node) == parse(xml, Node)
+ end
+end
+
+#==============================================================================#
+# XML WRITING #
+#==============================================================================#
+@testset "XML Writing" begin
+ @testset "write Text" begin
+ el = Element("p", "hello & goodbye")
+ @test XML.write(el) == "hello & goodbye
"
+ end
+
+ @testset "write Element with attributes" begin
+ el = Element("div"; class="main", id="content")
+ s = XML.write(el)
+ @test contains(s, "")
+ end
+
+ @testset "write self-closing element" begin
+ @test XML.write(Element("br")) == "
"
+ end
+
+ @testset "write element with single text child (inline)" begin
+ @test XML.write(Element("p", "hello")) == "hello
"
+ end
+
+ @testset "write element with multiple children (indented)" begin
+ el = Element("div", Element("a"), Element("b"))
+ s = XML.write(el)
+ @test contains(s, "")
+ @test contains(s, "
")
+ @test contains(s, "
")
+ @test contains(s, "
")
+ end
+
+ @testset "write Comment" begin
+ el = Element("root", Comment(" comment "))
+ @test contains(XML.write(el), "")
+ end
+
+ @testset "write CData" begin
+ el = Element("root", CData("raw "))
+ @test contains(XML.write(el), "]]>")
+ end
+
+ @testset "write ProcessingInstruction with content" begin
+ pi = ProcessingInstruction("target", "data")
+ @test XML.write(pi) == ""
+ end
+
+ @testset "write ProcessingInstruction without content" begin
+ pi = ProcessingInstruction("target")
+ @test XML.write(pi) == ""
+ end
+
+ @testset "write Declaration" begin
+ decl = Declaration(; version="1.0", encoding="UTF-8")
+ s = XML.write(decl)
+ @test contains(s, "")
+ end
+
+ @testset "write DTD" begin
+ dtd = DTD("html")
+ @test XML.write(dtd) == ""
+ end
+
+ @testset "write Document" begin
+ doc = Document(Declaration(; version="1.0"), Element("root"))
+ s = XML.write(doc)
+ @test startswith(s, "")
+ end
+
+ @testset "write escapes special characters in text" begin
+ el = Element("p", "a < b & c > d")
+ @test XML.write(el) == "a < b & c > d
"
+ end
+
+ @testset "write escapes special characters in attribute values" begin
+ el = Element("x"; a="a\"b")
+ @test contains(XML.write(el), "a=\"a"b\"")
+ end
+
+ @testset "indentsize parameter" begin
+ el = Element("root", Element("child"))
+ s2 = XML.write(el; indentsize=2)
+ s4 = XML.write(el; indentsize=4)
+ @test contains(s2, " ")
+ @test contains(s4, " ")
+ end
+
+ @testset "write xml:space='preserve' respects whitespace" begin
+ el = Element("root", Element("p", Text(" hello "); var"xml:space"="preserve"))
+ s = XML.write(el)
+ @test contains(s, "> hello