Skip to content

[LOW] ProfileSanitizer.stripUnsafe misses supplementary-plane and generic format (Cf) characters — invisible Unicode TAG block survives #184

@erskingardner

Description

@erskingardner

Severity: LOW

Summary

ProfileSanitizer.stripUnsafe iterates the string as UTF-16 Char units and removes control chars (Cc) plus an explicit allow/deny list of specific BMP code points. It does not strip:

  • Supplementary-plane (non-BMP) format characters, most notably the Unicode TAG block U+E0000U+E007F (Cf), a documented invisible-text / spoofing / steganography vector. These are encoded as surrogate pairs, and Character.getType(Char) on an isolated surrogate returns SURROGATE, never FORMAT/CONTROL, so they fall through to the else -> append(char) branch unchanged.
  • BMP format chars not in the explicit list (the code only catches an enumerated set), so any future-assigned or omitted Cf code point also survives.

Evidence

app/src/main/java/dev/ipf/darkmatter/core/ProfileSanitizer.kt

fun stripUnsafe(value: String): String =
    buildString(value.length) {
        value.forEach { char ->                                  // per UTF-16 Char
            when {
                char == '\n' || char == '\t' || char == '\r' -> append(char)
                Character.getType(char) == Character.CONTROL.toInt() -> Unit
                char.code == 0x200E || char.code == 0x200F -> Unit
                ...                                              // explicit BMP list only
                else -> append(char)                            // non-BMP Cf (e.g. U+E0001) lands here
            }
        }
    }

Impact

Display names, about text, and message bodies (all routed through stripUnsafe) can carry invisible TAG-block characters and other supplementary-plane / unlisted format characters, enabling hidden-text and homoglyph-style spoofing that the sanitizer is meant to prevent.

Suggested fix

Iterate by code point and classify generically:

value.codePoints().forEach { cp ->
    val type = Character.getType(cp)
    when {
        cp == '\n'.code || cp == '\t'.code || cp == '\r'.code -> appendCodePoint(cp)
        type == Character.CONTROL.toInt() || type == Character.FORMAT.toInt() -> {
            // keep only the deliberately-allowed shaping joiners
            if (cp == 0x200C || cp == 0x200D) appendCodePoint(cp)
        }
        else -> appendCodePoint(cp)
    }
}

This covers non-BMP format characters (the TAG block) and any unlisted Cf code point in one rule, while preserving ZWNJ/ZWJ.

Validation

Add a test asserting stripUnsafe("a󠄁b") (U+E0101) and a TAG-block string return only the visible characters.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions