Skip to content

marekkowalczyk/sanitize

Repository files navigation

sanitize

Zero-config filename sanitizer. Turn any string into a clean, portable filename -- lowercase, ASCII-only, no spaces, no surprises. Single binary, no dependencies, no configuration.

"Łódź — Recipe (Final).pdf"  →  lodz-recipe-final.pdf

Inspired by Brian P. Hogan's Small, Sharp Software Tools: does one thing well, works with text streams, composes with pipes, stays quiet.

Why another tool?

Most filename cleaners either require configuration (detox needs .detoxrc sequence files) or only work as libraries (python-slugify, sanitize-filename). General-purpose renamers like rename require writing Perl expressions for every invocation. sanitize is opinionated by design: zero flags needed for the common case, predictable output, safe defaults.

What it does

The intent is to be a failsafe: every Latin or Latin-adjacent character should produce a reasonable ASCII output -- never silently vanish and never cause an error. If a character has a conventional Latin transliteration, sanitize should handle it.

Lowercases, strips diacritics, replaces non-alphanumeric characters with hyphens, deduplicates hyphens, and trims ends. Output is restricted to [a-z0-9-] for strings and [a-z0-9.-] for filenames. 190 special-case transliterations handle characters that don't NFD-decompose (ł→l, ß→ss, ø→o, æ→ae, œ→oe, №→no, €→eur, and many more). The complete table is published at references/transliterations.csv.

Every output is validated against a strict postcondition before being returned. If the pipeline produces any disallowed character, the tool returns an error rather than silently passing through an unsafe result.

Installation

Pre-built binaries

Download from GitHub Releases -- binaries are available for Linux, macOS, and Windows (amd64 and arm64).

From source

go install github.com/marekkowalczyk/sanitize@latest

Or from a local checkout:

go install

Both install the sanitize binary to $GOPATH/bin (typically ~/go/bin).

To also use the san shortcut for file renaming:

ln -s ~/go/bin/sanitize /usr/local/bin/san

Usage

Sanitize text

sanitize "Hello, World!"              # hello-world
sanitize "Zażółć gęślą jaźń"          # zazolc-gesla-jazn
sanitize "Straße nach München"         # strasse-nach-munchen
sanitize foo bar baz                   # foo-bar-baz (multiple args joined)

Read from stdin

echo "Café Résumé" | sanitize          # cafe-resume
cat filenames.txt | sanitize           # one output per line

When no arguments are given and input is piped, sanitize reads one line at a time from stdin and outputs one sanitized line per line. Blank lines and lines that sanitize to empty are skipped.

Null-delimited mode (-0)

For filenames that may contain newlines, use -0 for null-delimited I/O (like find -print0 / xargs -0):

find . -print0 | sanitize -0           # null-delimited input and output
find . -print0 | sanitize -0 | xargs -0 echo

Rename files (-f or san)

sanitize -f "My Document.PDF"         # renames to my-document.pdf
sanitize -f *.txt                      # rename multiple files (shell expands the glob)
san "My Document.PDF"                  # same as sanitize -f
san *.txt                              # same as sanitize -f *.txt

File rename mode splits the filename from its extension, sanitizes each part separately, and renames the file. It will not overwrite existing files. Renames are printed to stderr.

When the binary is invoked as san (via symlink), file rename mode is enabled automatically without needing -f.

Glob patterns (*.txt, IMG_*.jpg, etc.) are expanded by the shell before sanitize sees them -- this is standard Unix behavior and requires no special handling by the tool.

Recursive rename (-r)

sanitize -r ~/Downloads/          # recursively rename all files and dirs
sanitize -rn ~/Downloads/         # dry run: show what would be renamed
san -r ~/Downloads/               # same thing via san symlink

Recursive mode walks a directory tree depth-first, sanitizing all filenames and directory names. Deepest entries are renamed first so that parent renames don't invalidate child paths. The -r flag implies file mode (-f). Combines with -n for dry run. Handles SIGINT gracefully -- if you press Ctrl+C during a recursive rename, it stops cleanly between files rather than mid-rename.

Dry run (-n)

sanitize -n *.txt                      # show what would be renamed (-n implies -f)
sanitize -f -n *.txt                   # explicit -f also works
san -n *.txt                           # same thing via san
sanitize -fn *.txt                     # combined short flags also work

The -n flag implies file mode (-f), since dry-run only makes sense for renames.

Other flags

sanitize --version                     # print version
sanitize --help                        # print usage

Short flags can be combined: -fn is the same as -f -n. Long forms are also available: --file, --recursive, --dry-run, --null.

Transformation pipeline

input -> removeIllFormed -> removeAccents -> toLower -> replaceNonAlphaNum -> dedupHyp -> trimEnds -> validate -> output
  1. removeIllFormed -- replace ill-formed UTF-8 sequences
  2. removeAccents -- NFD decomposition + strip combining marks (unicode.Mn), plus special-case replacements for standalone characters that don't decompose (ł -> l, ß -> ss, -> No, -> EUR, etc.)
  3. toLower -- lowercase the entire string
  4. replaceNonAlphaNum -- replace anything outside unicode.Latin and digits with -
  5. dedupHyp -- collapse runs of -- into a single -
  6. trimEnds -- strip leading/trailing non-Latin, non-digit characters
  7. validate -- postcondition check: verify output contains only [a-z0-9-], no leading/trailing or consecutive hyphens. Returns an error if any disallowed character is present (names the offending character and its Unicode codepoint)

Handling of non-ASCII characters

All non-ASCII characters are transformed to their ASCII equivalents where possible:

Kąt na łące żre źrebię   ->   kat-na-lace-zre-zrebie

This is achieved by Unicode NFD decomposition followed by removal of Mark, Nonspacing characters. For example, ą is a combined with COMBINING OGONEK (U+0328) -- removing the combining mark leaves a.

Special cases

Some characters are standalone Latin letters that don't decompose into base + combining mark. These are handled via a specialCases table (190 entries, sourced from Unicode CLDR Latin-ASCII, AnyAscii, and Unidecode) with direct string replacement. The complete table with final (lowercased) outputs is at references/transliterations.csv (auto-generated from source; run GENERATE=1 go test -run TestGenerateTransliterations to regenerate). Examples:

Character Replacement Language/Use
ł/Ł l/L Polish barred L
ß/ ss/SS German eszett + capital sharp S
đ/Đ d/D Croatian/Vietnamese barred D
ø/Ø o/O Danish/Norwegian slashed O
æ/Æ ae/AE Danish/Norwegian/Icelandic ligature
œ/Œ oe/OE French ligature
ħ/Ħ h/H Maltese barred H
ı i Turkish dotless I
þ/Þ th/Th Icelandic thorn
ð/Ð d/D Icelandic/Faroese eth
ŋ/Ŋ ng/Ng Sami/African eng
ŧ/Ŧ t/T Sami barred T
ij/IJ ij/IJ Dutch IJ ligature
ŀ/Ŀ l/L Catalan middle-dot L
ĸ k Greenlandic kra
ſ s Historical long S
ə/Ə e/E Azerbaijani/African schwa
ɛ/Ɛ e/E African open E (Ewe, Akan)
ɔ/Ɔ o/O African open O (Akan, Ewe)
ɓ/Ɓ b/B African hooked B (Hausa, Fula)
ɗ/Ɗ d/D African hooked D (Hausa, Fula)
ƙ/Ƙ k/K African hooked K (Hausa)
ʃ/Ʃ sh/Sh African esh (Pan-Nigerian)
ʒ/Ʒ zh/Zh African ezh (Skolt Sami)
dž/lj/nj dz/lj/nj Croatian digraphs
//// ff/fi/fl/ffi/ffl Typographic ligatures
... ... + 20 more African/historical entries

Non-Latin scripts

Characters from non-Latin scripts (Chinese, Cyrillic, Arabic, etc.) are replaced with hyphens and then cleaned up by deduplication and trimming:

Hello你好World   ->   hello-world

DEVONthink integration

contrib/DEVONthink-Sanitize-Filenames.applescript sanitizes names of selected DEVONthink records, setting the Finder Comment field to the original filename. Note: the existing Finder Comment is overwritten.

Installing the script in DEVONthink

  1. Open DEVONthink
  2. Go to DEVONthink > Preferences > Scripts (or in DEVONthink 3, the Scripts folder is at ~/Library/Application Scripts/com.devon-technologies.think3/Menu)
  3. Copy or symlink the script into the DEVONthink scripts folder:
    cp contrib/DEVONthink-Sanitize-Filenames.applescript \
      ~/Library/Application\ Scripts/com.devon-technologies.think3/Menu/Sanitize\ Filenames.scpt
    Or compile and copy:
    osacompile -o ~/Library/Application\ Scripts/com.devon-technologies.think3/Menu/Sanitize\ Filenames.scpt \
      contrib/DEVONthink-Sanitize-Filenames.applescript
  4. The script appears in the Scripts menu inside DEVONthink
  5. Select one or more records, then run the script from the menu

The script requires sanitize to be installed at ~/go/bin/sanitize (the default go install location).

What the script does

For each selected record:

  1. Saves the original name to the record's Finder Comment field
  2. Runs sanitize on the name
  3. Sets the record name to the sanitized result

Migrating from san.sh

If you previously used san.sh as a bash wrapper for file renaming, the functionality is now built into the Go binary. To migrate:

  1. Build and test locally (without touching your installed tools):

    go build -o ./sanitize .
    ln -s ./sanitize ./san
    ./san "Test File.txt"              # verify it works
  2. Install the updated binary:

    go install                         # updates ~/go/bin/sanitize
  3. Replace the old san.sh (back up first):

    cp /usr/local/bin/san /usr/local/bin/san.sh.bak
    ln -sf ~/go/bin/sanitize /usr/local/bin/san
  4. Verify:

    which san                          # should show /usr/local/bin/san
    san --help                         # should show usage with -f mode

After migration, /usr/local/bin/san.sh.bak can be removed when you're confident everything works. The original san.sh is preserved in legacy/ for reference.

Behavioral differences from san.sh

  • Dotfiles (e.g., .gitignore) are preserved as-is (san.sh would strip the dot)
  • Already-clean files are skipped silently (san.sh would call mv anyway)
  • Full path support (san.sh only worked with bare filenames)
  • Case-only renames work on macOS (san.sh's mv -n would block them)

Building and releasing

Local build with version tag

go build -ldflags "-X main.version=1.0.0" .

Without -ldflags, the version defaults to dev.

Releasing

Releases are automated via GoReleaser and GitHub Actions. To cut a release:

git tag v1.0.0
git push --tags

This triggers .github/workflows/release.yml, which builds cross-platform binaries (linux/darwin/windows, amd64/arm64) and publishes them as a GitHub Release.

Design philosophy

sanitize follows the Unix tool conventions described in Brian P. Hogan's Small, Sharp Software Tools:

  • Do one thing well -- sanitize strings for filenames, nothing else
  • Work with text streams -- reads stdin, writes to stdout, one entry per line
  • Use standard I/O -- output to stdout, diagnostics to stderr, meaningful exit codes
  • Be quiet -- no banners, progress messages, or decorative output
  • Be a filter -- sits in the middle of a pipeline: cat list.txt | sanitize | xargs ...
  • Support null delimiters -- -0 for filenames containing newlines
  • Dry run -- -n shows what would happen without doing it

POSIX compliance

Flag handling follows POSIX conventions:

Convention Example
Short flags -f, -r, -n, -0
Combined short flags -fn equals -f -n, -rn equals -r -n
Flag implication -n implies -f, -r implies -f
Long flags --file, --recursive, --dry-run, --null, --version, --help
-- separator sanitize -- -hello treats -hello as text, not a flag
Unknown flag prints error to stderr, exits 2
--help prints usage to stderr, exits 0
--version prints version to stdout, exits 0

Exit codes:

Code Meaning
0 Success
1 Runtime error (rename failed, target exists, etc.)
2 Usage error or postcondition failure (unknown flag, missing arguments, disallowed character in output)

The -f concession

The -f file rename mode is a pragmatic concession. Strictly speaking, a pure Unix tool would only transform text, and you'd compose it with mv:

for f in *.txt; do mv "$f" "$(sanitize "$f")"; done

The -f flag bundles transform + rename into one operation because it's a common workflow that's error-prone to do by hand (splitting extensions, handling no-clobber, case-insensitive filesystems). The san symlink makes this even more convenient. This trades Unix purity for daily usability.

Caution

Different input strings can produce identical output. This is by design -- the tool is lossy.

Collision risk in file rename mode

Because the transformation is lossy, multiple files in the same directory can sanitize to the same name. For example, Café.txt, cafe!.txt, and CAFÉ.txt all become cafe.txt.

Current protection: Before each rename, the tool checks whether the target already exists (os.Stat + no-clobber). If it does, the rename is skipped with an error. This prevents data loss -- os.Rename on Unix silently overwrites, so this check is the sole safeguard.

Remaining risk -- partial renames: When renaming multiple files (-f *.txt) or recursively (-r), the first collision succeeds and subsequent ones are skipped. This leaves you in a half-renamed state: some files moved, others didn't. With -r on a deep directory tree this can be especially messy, as some directories may have been renamed while files inside sibling directories were blocked.

Mitigation: Always use -n (dry run) first on unfamiliar directories to check for collisions before committing to renames. See dev/BACKLOG.md for a planned pre-scan feature that would detect all collisions up front and abort before any renames happen.

Testing

go test -v

The test suite includes 400+ cases covering individual pipeline stages, postcondition validation, full integration, pipeline ordering, idempotency, file renaming, recursive directory renaming, dry run, null-delimited I/O, stdin processing, combined flags, context cancellation, CLI behavior, and an adversarial suite (sanitize_adversarial_test.go) with LLM-generated edge cases targeting Unicode normalization gotchas, unhandled Latin script boundaries, Go case-folding quirks, path traversal, accidental dotfile creation, and malicious payloads (null bytes, control characters, PUA codepoints, Cyrillic homoglyphs).

Benchmarks

go test -bench=. -benchmem -run=^$

Benchmarks cover each pipeline stage and the full sanitize()/sanitizeFilename() functions.

Man page

A man page is included as sanitize.1. To install locally:

cp sanitize.1 /usr/local/share/man/man1/
man sanitize

The man page is also included in goreleaser archives.

Comparison with similar tools

sanitize detox rename slugify python-slugify
Language Go C Perl Bash/Python/Node Python
Approach Zero-config, opinionated Configurable (.detoxrc) Perl expressions String-to-slug text-unidecode
Config required No Yes Yes (per invocation) No No
File rename Yes (-f, san) Yes Yes Varies Minimal CLI
Recursive Yes (-r) Yes No No No
Dry run Yes (-n) Yes Yes No No
Null-delimited I/O Yes (-0) No No No No
Latin-only output Yes No No No No
Diacritic handling NFD + 190 special cases Configurable sequences Manual Basic text-unidecode
Postcondition check Yes No No No No
Dependencies None (static binary) C library Perl Varies Python + pip

Also in the space: convmv (encoding conversion, not content), mmv (batch wildcard rename), vidir (interactive rename in $EDITOR), go-slugify / filenamify (libraries, no CLI).

How sanitize differs

Closest competitor is detox, which also cleans filenames, transliterates UTF-8, and has recursive + dry-run modes. detox is more configurable (sequence files), but sanitize is zero-config, restricts output to Latin script, and handles special cases (Polish ł, German ß, Danish ø/æ, French œ, Croatian đ, Maltese ħ, Turkish ı) via NFD decomposition + a dedicated replacement table.

rename/prename is far more powerful but requires writing Perl expressions -- it's a general renamer, not a sanitizer. sanitize trades flexibility for zero-config simplicity.

slugify tools are the closest conceptual match, but are typically string-only transformers with no file operations, recursion, or null-delimited I/O.

What sanitize offers that others don't

  • Zero-config opinionated pipeline -- no regex, config files, or flags needed for the common case
  • Latin-script-only output -- unique among these tools; non-Latin characters (Chinese, Cyrillic, Arabic) are stripped rather than passed through
  • Postcondition validation -- every output is verified against [a-z0-9-] before returning; failures produce a diagnostic error, not silent corruption
  • Special-case transliterations -- 190 entries covering standalone Latin characters, Roman numerals, super/subscript digits, vulgar fractions, letterlike symbols (№, ™, µ), currency symbols (€, £, ¥), common signs (©, ®, §, °, ×), and ASCII symbols with semantic meaning ($→usd, &→and, @→at, %→pct, +→plus)
  • Single static binary -- Go, no runtime dependencies, cross-platform builds via goreleaser

What others offer that sanitize doesn't

  • detox -- configurable transliteration tables and wipeup sequences
  • rename -- arbitrary transformation logic via Perl expressions
  • vidir -- interactive editing of filenames in your text editor
  • python-slugify -- broader transliteration coverage via text-unidecode (handles more scripts than NFD decomposition)