Invisible Unicode bidirectional characters can make source code display differently than it executes. This is CVE-2021-42574 — the Trojan Source attack.
An attacker embeds characters that reverse text direction. In a code review, you see:
access_level = "admin"But the compiler sees:
access_level = "user"The characters are invisible. Your editor won't show them. Your reviewer won't catch them. bidi-guard will.
pip install bidi-guard# Scan current directory recursively
bidi-guard scan .
# Scan a specific file
bidi-guard scan src/auth.pyFinds all bidirectional Unicode characters, shows their exact position, context, and severity.
| Command | What it does |
|---|---|
scan |
Scan files for bidi characters, show findings with context |
ci |
CI mode — exits non-zero if any dangerous characters found |
fix |
Remove all bidi characters from files (interactive by default) |
explain |
Show what each bidi character does with visual examples |
init |
Generate a .bidi-guard.yml config file for your project |
name: BiDi Guard
on: [push, pull_request]
jobs:
bidi-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install bidi-guard
- run: bidi-guard ci .Add to your .pre-commit-config.yaml:
repos:
- repo: https://github.com/Moshe-ship/bidi-guard
rev: v0.1.0
hooks:
- id: bidi-guardAll 16 Unicode bidirectional control characters:
| Code Point | Name | Abbreviation | Severity |
|---|---|---|---|
| U+200E | Left-to-Right Mark | LRM | Low |
| U+200F | Right-to-Left Mark | RLM | Low |
| U+061C | Arabic Letter Mark | ALM | Low |
| U+200D | Zero Width Joiner | ZWJ | Low |
| U+200C | Zero Width Non-Joiner | ZWNJ | Low |
| U+2066 | Left-to-Right Isolate | LRI | Medium |
| U+2067 | Right-to-Left Isolate | RLI | High |
| U+2068 | First Strong Isolate | FSI | High |
| U+2069 | Pop Directional Isolate | PDI | Medium |
| U+202A | Left-to-Right Embedding | LRE | Medium |
| U+202B | Right-to-Left Embedding | RLE | High |
| U+202C | Pop Directional Formatting | Medium | |
| U+202D | Left-to-Right Override | LRO | High |
| U+202E | Right-to-Left Override | RLO | Critical |
| U+2028 | Line Separator | LS | Medium |
| U+2029 | Paragraph Separator | PS | Medium |
Characters marked High or Critical can actively reorder displayed text — these are the ones used in Trojan Source attacks.
from bidi_guard import scan_file, Finding
findings = scan_file("src/auth.py")
for f in findings:
print(f"{f.char_name} at line {f.line}:{f.col} — severity: {f.severity}")These characters are invisible. Literally invisible:
grep "RLO" file.py # finds nothing
cat file.py # looks completely normalbidi-guard shows you what grep cannot: the exact character, its position, surrounding context, severity level, and what it does. It can also auto-fix by stripping them out.
MIT — Musa the Carpenter
Brought to you by the Saudi AI Community — for Arabs first, and the world at large.