Skip to content

Moshe-ship/bidi-guard

Repository files navigation

bidi-guard

Scan code for invisible bidirectional Unicode characters

MIT License Python 3.9+


Why This Exists

Invisible Unicode bidirectional characters can make source code display differently than it executes. This is CVE-2021-42574 — the Trojan Source attack.

An attacker embeds characters that reverse text direction. In a code review, you see:

access_level = "admin"

But the compiler sees:

access_level = "user"

The characters are invisible. Your editor won't show them. Your reviewer won't catch them. bidi-guard will.

Install

pip install bidi-guard

Quick Start

# Scan current directory recursively
bidi-guard scan .

# Scan a specific file
bidi-guard scan src/auth.py

Finds all bidirectional Unicode characters, shows their exact position, context, and severity.

Commands

Command What it does
scan Scan files for bidi characters, show findings with context
ci CI mode — exits non-zero if any dangerous characters found
fix Remove all bidi characters from files (interactive by default)
explain Show what each bidi character does with visual examples
init Generate a .bidi-guard.yml config file for your project

CI Integration

GitHub Action

name: BiDi Guard
on: [push, pull_request]
jobs:
  bidi-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install bidi-guard
      - run: bidi-guard ci .

Pre-commit

Add to your .pre-commit-config.yaml:

repos:
  - repo: https://github.com/Moshe-ship/bidi-guard
    rev: v0.1.0
    hooks:
      - id: bidi-guard

Detected Characters

All 16 Unicode bidirectional control characters:

Code Point Name Abbreviation Severity
U+200E Left-to-Right Mark LRM Low
U+200F Right-to-Left Mark RLM Low
U+061C Arabic Letter Mark ALM Low
U+200D Zero Width Joiner ZWJ Low
U+200C Zero Width Non-Joiner ZWNJ Low
U+2066 Left-to-Right Isolate LRI Medium
U+2067 Right-to-Left Isolate RLI High
U+2068 First Strong Isolate FSI High
U+2069 Pop Directional Isolate PDI Medium
U+202A Left-to-Right Embedding LRE Medium
U+202B Right-to-Left Embedding RLE High
U+202C Pop Directional Formatting PDF Medium
U+202D Left-to-Right Override LRO High
U+202E Right-to-Left Override RLO Critical
U+2028 Line Separator LS Medium
U+2029 Paragraph Separator PS Medium

Characters marked High or Critical can actively reorder displayed text — these are the ones used in Trojan Source attacks.

As a Library

from bidi_guard import scan_file, Finding

findings = scan_file("src/auth.py")
for f in findings:
    print(f"{f.char_name} at line {f.line}:{f.col} — severity: {f.severity}")

Why Not Just grep?

These characters are invisible. Literally invisible:

grep "RLO" file.py    # finds nothing
cat file.py            # looks completely normal

bidi-guard shows you what grep cannot: the exact character, its position, surrounding context, severity level, and what it does. It can also auto-fix by stripping them out.

License

MIT — Musa the Carpenter


Brought to you by the Saudi AI Community — for Arabs first, and the world at large.

About

Scan code for invisible bidirectional Unicode characters (Trojan Source attack prevention, CVE-2021-42574)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages