Severity: LOW
Summary
ProfileSanitizer.stripUnsafe iterates the string as UTF-16 Char units and removes control chars (Cc) plus an explicit allow/deny list of specific BMP code points. It does not strip:
- Supplementary-plane (non-BMP) format characters, most notably the Unicode TAG block
U+E0000–U+E007F (Cf), a documented invisible-text / spoofing / steganography vector. These are encoded as surrogate pairs, and Character.getType(Char) on an isolated surrogate returns SURROGATE, never FORMAT/CONTROL, so they fall through to the else -> append(char) branch unchanged.
- BMP format chars not in the explicit list (the code only catches an enumerated set), so any future-assigned or omitted
Cf code point also survives.
Evidence
app/src/main/java/dev/ipf/darkmatter/core/ProfileSanitizer.kt
fun stripUnsafe(value: String): String =
buildString(value.length) {
value.forEach { char -> // per UTF-16 Char
when {
char == '\n' || char == '\t' || char == '\r' -> append(char)
Character.getType(char) == Character.CONTROL.toInt() -> Unit
char.code == 0x200E || char.code == 0x200F -> Unit
... // explicit BMP list only
else -> append(char) // non-BMP Cf (e.g. U+E0001) lands here
}
}
}
Impact
Display names, about text, and message bodies (all routed through stripUnsafe) can carry invisible TAG-block characters and other supplementary-plane / unlisted format characters, enabling hidden-text and homoglyph-style spoofing that the sanitizer is meant to prevent.
Suggested fix
Iterate by code point and classify generically:
value.codePoints().forEach { cp ->
val type = Character.getType(cp)
when {
cp == '\n'.code || cp == '\t'.code || cp == '\r'.code -> appendCodePoint(cp)
type == Character.CONTROL.toInt() || type == Character.FORMAT.toInt() -> {
// keep only the deliberately-allowed shaping joiners
if (cp == 0x200C || cp == 0x200D) appendCodePoint(cp)
}
else -> appendCodePoint(cp)
}
}
This covers non-BMP format characters (the TAG block) and any unlisted Cf code point in one rule, while preserving ZWNJ/ZWJ.
Validation
Add a test asserting stripUnsafe("a󠄁b") (U+E0101) and a TAG-block string return only the visible characters.
Severity: LOW
Summary
ProfileSanitizer.stripUnsafeiterates the string as UTF-16Charunits and removes control chars (Cc) plus an explicit allow/deny list of specific BMP code points. It does not strip:U+E0000–U+E007F(Cf), a documented invisible-text / spoofing / steganography vector. These are encoded as surrogate pairs, andCharacter.getType(Char)on an isolated surrogate returnsSURROGATE, neverFORMAT/CONTROL, so they fall through to theelse -> append(char)branch unchanged.Cfcode point also survives.Evidence
app/src/main/java/dev/ipf/darkmatter/core/ProfileSanitizer.ktImpact
Display names, about text, and message bodies (all routed through
stripUnsafe) can carry invisible TAG-block characters and other supplementary-plane / unlisted format characters, enabling hidden-text and homoglyph-style spoofing that the sanitizer is meant to prevent.Suggested fix
Iterate by code point and classify generically:
value.codePoints().forEach { cp -> val type = Character.getType(cp) when { cp == '\n'.code || cp == '\t'.code || cp == '\r'.code -> appendCodePoint(cp) type == Character.CONTROL.toInt() || type == Character.FORMAT.toInt() -> { // keep only the deliberately-allowed shaping joiners if (cp == 0x200C || cp == 0x200D) appendCodePoint(cp) } else -> appendCodePoint(cp) } }This covers non-BMP format characters (the TAG block) and any unlisted
Cfcode point in one rule, while preserving ZWNJ/ZWJ.Validation
Add a test asserting
stripUnsafe("a󠄁b")(U+E0101) and a TAG-block string return only the visible characters.