Skip to content

Incorrect unescape result. #53

@TimG1964

Description

@TimG1964
julia> XML.unescape("" &")
"\" &"

I think the correct result should be " &

The unescape function processes each escape sequence as it appears after the previous unescape has completed.

So, in order, " => " and then, in a second bite, " => "

I think this is incorrect.

Claude suggests the following:

const escape_chars = ['&' => "&amp;", '<' => "&lt;", '>' => "&gt;", '"' => "&quot;", '\'' => "&apos;"]

function unescape(x::AbstractString)
           result = x
           for (char, entity) in reverse(escape_chars)
               result = replace(result, entity => char)
           end
           return result
       end

Further, about the escape function, Claude says:
"This approach is clever but has a subtle bug — the regex r"&(?!amp;|quot;|apos;|gt;|lt;)" is intended to only escape & characters that aren't already part of an XML entity, but this means the function assumes the input may contain already-escaped XML entities and tries to preserve them. That's an unusual contract for an escape function, which normally treats its input as plain text and escapes everything unconditionally. If the lookahead behaviour is intentional, it's worth documenting clearly that the function is idempotent by design."

It therefore suggests this for escape:

function escape(x::AbstractString)
    result = replace(x, '&' => "&amp;")
    for (char, entity) in escape_chars[2:end]
        result = replace(result, char => entity)
    end
    return result
end

This also restores AbstractString from your original for generality.

As I said before, I don't have a view about the behaviour of escape but I do think the unescape behaviour is wrong and should be fixed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions