Zero-config filename sanitizer. Turn any string into a clean, portable filename -- lowercase, ASCII-only, no spaces, no surprises. Single binary, no dependencies, no configuration.
"Łódź — Recipe (Final).pdf" → lodz-recipe-final.pdf
Inspired by Brian P. Hogan's Small, Sharp Software Tools: does one thing well, works with text streams, composes with pipes, stays quiet.
Most filename cleaners either require configuration (detox needs .detoxrc sequence files) or only work as libraries (python-slugify, sanitize-filename). General-purpose renamers like rename require writing Perl expressions for every invocation. sanitize is opinionated by design: zero flags needed for the common case, predictable output, safe defaults.
The intent is to be a failsafe: every Latin or Latin-adjacent character should produce a reasonable ASCII output -- never silently vanish and never cause an error. If a character has a conventional Latin transliteration, sanitize should handle it.
Lowercases, strips diacritics, replaces non-alphanumeric characters with hyphens, deduplicates hyphens, and trims ends. Output is restricted to [a-z0-9-] for strings and [a-z0-9.-] for filenames. 190 special-case transliterations handle characters that don't NFD-decompose (ł→l, ß→ss, ø→o, æ→ae, œ→oe, №→no, €→eur, and many more). The complete table is published at references/transliterations.csv.
Every output is validated against a strict postcondition before being returned. If the pipeline produces any disallowed character, the tool returns an error rather than silently passing through an unsafe result.
Download from GitHub Releases -- binaries are available for Linux, macOS, and Windows (amd64 and arm64).
go install github.com/marekkowalczyk/sanitize@latestOr from a local checkout:
go installBoth install the sanitize binary to $GOPATH/bin (typically ~/go/bin).
To also use the san shortcut for file renaming:
ln -s ~/go/bin/sanitize /usr/local/bin/sansanitize "Hello, World!" # hello-world
sanitize "Zażółć gęślą jaźń" # zazolc-gesla-jazn
sanitize "Straße nach München" # strasse-nach-munchen
sanitize foo bar baz # foo-bar-baz (multiple args joined)echo "Café Résumé" | sanitize # cafe-resume
cat filenames.txt | sanitize # one output per lineWhen no arguments are given and input is piped, sanitize reads one line at a time from stdin and outputs one sanitized line per line. Blank lines and lines that sanitize to empty are skipped.
For filenames that may contain newlines, use -0 for null-delimited I/O (like find -print0 / xargs -0):
find . -print0 | sanitize -0 # null-delimited input and output
find . -print0 | sanitize -0 | xargs -0 echosanitize -f "My Document.PDF" # renames to my-document.pdf
sanitize -f *.txt # rename multiple files (shell expands the glob)
san "My Document.PDF" # same as sanitize -f
san *.txt # same as sanitize -f *.txtFile rename mode splits the filename from its extension, sanitizes each part separately, and renames the file. It will not overwrite existing files. Renames are printed to stderr.
When the binary is invoked as san (via symlink), file rename mode is enabled automatically without needing -f.
Glob patterns (*.txt, IMG_*.jpg, etc.) are expanded by the shell before sanitize sees them -- this is standard Unix behavior and requires no special handling by the tool.
sanitize -r ~/Downloads/ # recursively rename all files and dirs
sanitize -rn ~/Downloads/ # dry run: show what would be renamed
san -r ~/Downloads/ # same thing via san symlinkRecursive mode walks a directory tree depth-first, sanitizing all filenames and directory names. Deepest entries are renamed first so that parent renames don't invalidate child paths. The -r flag implies file mode (-f). Combines with -n for dry run. Handles SIGINT gracefully -- if you press Ctrl+C during a recursive rename, it stops cleanly between files rather than mid-rename.
sanitize -n *.txt # show what would be renamed (-n implies -f)
sanitize -f -n *.txt # explicit -f also works
san -n *.txt # same thing via san
sanitize -fn *.txt # combined short flags also workThe -n flag implies file mode (-f), since dry-run only makes sense for renames.
sanitize --version # print version
sanitize --help # print usageShort flags can be combined: -fn is the same as -f -n. Long forms are also available: --file, --recursive, --dry-run, --null.
input -> removeIllFormed -> removeAccents -> toLower -> replaceNonAlphaNum -> dedupHyp -> trimEnds -> validate -> output
- removeIllFormed -- replace ill-formed UTF-8 sequences
- removeAccents -- NFD decomposition + strip combining marks (unicode.Mn), plus special-case replacements for standalone characters that don't decompose (
ł->l,ß->ss,№->No,€->EUR, etc.) - toLower -- lowercase the entire string
- replaceNonAlphaNum -- replace anything outside
unicode.Latinand digits with- - dedupHyp -- collapse runs of
--into a single- - trimEnds -- strip leading/trailing non-Latin, non-digit characters
- validate -- postcondition check: verify output contains only
[a-z0-9-], no leading/trailing or consecutive hyphens. Returns an error if any disallowed character is present (names the offending character and its Unicode codepoint)
All non-ASCII characters are transformed to their ASCII equivalents where possible:
Kąt na łące żre źrebię -> kat-na-lace-zre-zrebie
This is achieved by Unicode NFD decomposition followed by removal of Mark, Nonspacing characters. For example, ą is a combined with COMBINING OGONEK (U+0328) -- removing the combining mark leaves a.
Some characters are standalone Latin letters that don't decompose into base + combining mark. These are handled via a specialCases table (190 entries, sourced from Unicode CLDR Latin-ASCII, AnyAscii, and Unidecode) with direct string replacement. The complete table with final (lowercased) outputs is at references/transliterations.csv (auto-generated from source; run GENERATE=1 go test -run TestGenerateTransliterations to regenerate). Examples:
| Character | Replacement | Language/Use |
|---|---|---|
ł/Ł |
l/L |
Polish barred L |
ß/ẞ |
ss/SS |
German eszett + capital sharp S |
đ/Đ |
d/D |
Croatian/Vietnamese barred D |
ø/Ø |
o/O |
Danish/Norwegian slashed O |
æ/Æ |
ae/AE |
Danish/Norwegian/Icelandic ligature |
œ/Œ |
oe/OE |
French ligature |
ħ/Ħ |
h/H |
Maltese barred H |
ı |
i |
Turkish dotless I |
þ/Þ |
th/Th |
Icelandic thorn |
ð/Ð |
d/D |
Icelandic/Faroese eth |
ŋ/Ŋ |
ng/Ng |
Sami/African eng |
ŧ/Ŧ |
t/T |
Sami barred T |
ij/IJ |
ij/IJ |
Dutch IJ ligature |
ŀ/Ŀ |
l/L |
Catalan middle-dot L |
ĸ |
k |
Greenlandic kra |
ſ |
s |
Historical long S |
ə/Ə |
e/E |
Azerbaijani/African schwa |
ɛ/Ɛ |
e/E |
African open E (Ewe, Akan) |
ɔ/Ɔ |
o/O |
African open O (Akan, Ewe) |
ɓ/Ɓ |
b/B |
African hooked B (Hausa, Fula) |
ɗ/Ɗ |
d/D |
African hooked D (Hausa, Fula) |
ƙ/Ƙ |
k/K |
African hooked K (Hausa) |
ʃ/Ʃ |
sh/Sh |
African esh (Pan-Nigerian) |
ʒ/Ʒ |
zh/Zh |
African ezh (Skolt Sami) |
dž/lj/nj |
dz/lj/nj |
Croatian digraphs |
ff/fi/fl/ffi/ffl |
ff/fi/fl/ffi/ffl |
Typographic ligatures |
| ... | ... | + 20 more African/historical entries |
Characters from non-Latin scripts (Chinese, Cyrillic, Arabic, etc.) are replaced with hyphens and then cleaned up by deduplication and trimming:
Hello你好World -> hello-world
contrib/DEVONthink-Sanitize-Filenames.applescript sanitizes names of selected DEVONthink records, setting the Finder Comment field to the original filename. Note: the existing Finder Comment is overwritten.
- Open DEVONthink
- Go to DEVONthink > Preferences > Scripts (or in DEVONthink 3, the Scripts folder is at
~/Library/Application Scripts/com.devon-technologies.think3/Menu) - Copy or symlink the script into the DEVONthink scripts folder:
Or compile and copy:
cp contrib/DEVONthink-Sanitize-Filenames.applescript \ ~/Library/Application\ Scripts/com.devon-technologies.think3/Menu/Sanitize\ Filenames.scpt
osacompile -o ~/Library/Application\ Scripts/com.devon-technologies.think3/Menu/Sanitize\ Filenames.scpt \ contrib/DEVONthink-Sanitize-Filenames.applescript
- The script appears in the Scripts menu inside DEVONthink
- Select one or more records, then run the script from the menu
The script requires sanitize to be installed at ~/go/bin/sanitize (the default go install location).
For each selected record:
- Saves the original name to the record's Finder Comment field
- Runs
sanitizeon the name - Sets the record name to the sanitized result
If you previously used san.sh as a bash wrapper for file renaming, the functionality is now built into the Go binary. To migrate:
-
Build and test locally (without touching your installed tools):
go build -o ./sanitize . ln -s ./sanitize ./san ./san "Test File.txt" # verify it works
-
Install the updated binary:
go install # updates ~/go/bin/sanitize -
Replace the old san.sh (back up first):
cp /usr/local/bin/san /usr/local/bin/san.sh.bak ln -sf ~/go/bin/sanitize /usr/local/bin/san -
Verify:
which san # should show /usr/local/bin/san san --help # should show usage with -f mode
After migration, /usr/local/bin/san.sh.bak can be removed when you're confident everything works. The original san.sh is preserved in legacy/ for reference.
- Dotfiles (e.g.,
.gitignore) are preserved as-is (san.sh would strip the dot) - Already-clean files are skipped silently (san.sh would call
mvanyway) - Full path support (san.sh only worked with bare filenames)
- Case-only renames work on macOS (san.sh's
mv -nwould block them)
go build -ldflags "-X main.version=1.0.0" .Without -ldflags, the version defaults to dev.
Releases are automated via GoReleaser and GitHub Actions. To cut a release:
git tag v1.0.0
git push --tagsThis triggers .github/workflows/release.yml, which builds cross-platform binaries (linux/darwin/windows, amd64/arm64) and publishes them as a GitHub Release.
sanitize follows the Unix tool conventions described in Brian P. Hogan's Small, Sharp Software Tools:
- Do one thing well -- sanitize strings for filenames, nothing else
- Work with text streams -- reads stdin, writes to stdout, one entry per line
- Use standard I/O -- output to stdout, diagnostics to stderr, meaningful exit codes
- Be quiet -- no banners, progress messages, or decorative output
- Be a filter -- sits in the middle of a pipeline:
cat list.txt | sanitize | xargs ... - Support null delimiters --
-0for filenames containing newlines - Dry run --
-nshows what would happen without doing it
Flag handling follows POSIX conventions:
| Convention | Example |
|---|---|
| Short flags | -f, -r, -n, -0 |
| Combined short flags | -fn equals -f -n, -rn equals -r -n |
| Flag implication | -n implies -f, -r implies -f |
| Long flags | --file, --recursive, --dry-run, --null, --version, --help |
-- separator |
sanitize -- -hello treats -hello as text, not a flag |
| Unknown flag | prints error to stderr, exits 2 |
--help |
prints usage to stderr, exits 0 |
--version |
prints version to stdout, exits 0 |
Exit codes:
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Runtime error (rename failed, target exists, etc.) |
| 2 | Usage error or postcondition failure (unknown flag, missing arguments, disallowed character in output) |
The -f file rename mode is a pragmatic concession. Strictly speaking, a pure Unix tool would only transform text, and you'd compose it with mv:
for f in *.txt; do mv "$f" "$(sanitize "$f")"; doneThe -f flag bundles transform + rename into one operation because it's a common workflow that's error-prone to do by hand (splitting extensions, handling no-clobber, case-insensitive filesystems). The san symlink makes this even more convenient. This trades Unix purity for daily usability.
Different input strings can produce identical output. This is by design -- the tool is lossy.
Because the transformation is lossy, multiple files in the same directory can sanitize to the same name. For example, Café.txt, cafe!.txt, and CAFÉ.txt all become cafe.txt.
Current protection: Before each rename, the tool checks whether the target already exists (os.Stat + no-clobber). If it does, the rename is skipped with an error. This prevents data loss -- os.Rename on Unix silently overwrites, so this check is the sole safeguard.
Remaining risk -- partial renames: When renaming multiple files (-f *.txt) or recursively (-r), the first collision succeeds and subsequent ones are skipped. This leaves you in a half-renamed state: some files moved, others didn't. With -r on a deep directory tree this can be especially messy, as some directories may have been renamed while files inside sibling directories were blocked.
Mitigation: Always use -n (dry run) first on unfamiliar directories to check for collisions before committing to renames. See dev/BACKLOG.md for a planned pre-scan feature that would detect all collisions up front and abort before any renames happen.
go test -vThe test suite includes 400+ cases covering individual pipeline stages, postcondition validation, full integration, pipeline ordering, idempotency, file renaming, recursive directory renaming, dry run, null-delimited I/O, stdin processing, combined flags, context cancellation, CLI behavior, and an adversarial suite (sanitize_adversarial_test.go) with LLM-generated edge cases targeting Unicode normalization gotchas, unhandled Latin script boundaries, Go case-folding quirks, path traversal, accidental dotfile creation, and malicious payloads (null bytes, control characters, PUA codepoints, Cyrillic homoglyphs).
go test -bench=. -benchmem -run=^$Benchmarks cover each pipeline stage and the full sanitize()/sanitizeFilename() functions.
A man page is included as sanitize.1. To install locally:
cp sanitize.1 /usr/local/share/man/man1/
man sanitizeThe man page is also included in goreleaser archives.
| sanitize | detox | rename | slugify | python-slugify | |
|---|---|---|---|---|---|
| Language | Go | C | Perl | Bash/Python/Node | Python |
| Approach | Zero-config, opinionated | Configurable (.detoxrc) |
Perl expressions | String-to-slug | text-unidecode |
| Config required | No | Yes | Yes (per invocation) | No | No |
| File rename | Yes (-f, san) |
Yes | Yes | Varies | Minimal CLI |
| Recursive | Yes (-r) |
Yes | No | No | No |
| Dry run | Yes (-n) |
Yes | Yes | No | No |
| Null-delimited I/O | Yes (-0) |
No | No | No | No |
| Latin-only output | Yes | No | No | No | No |
| Diacritic handling | NFD + 190 special cases | Configurable sequences | Manual | Basic | text-unidecode |
| Postcondition check | Yes | No | No | No | No |
| Dependencies | None (static binary) | C library | Perl | Varies | Python + pip |
Also in the space: convmv (encoding conversion, not content), mmv (batch wildcard rename), vidir (interactive rename in $EDITOR), go-slugify / filenamify (libraries, no CLI).
Closest competitor is detox, which also cleans filenames, transliterates UTF-8, and has recursive + dry-run modes. detox is more configurable (sequence files), but sanitize is zero-config, restricts output to Latin script, and handles special cases (Polish ł, German ß, Danish ø/æ, French œ, Croatian đ, Maltese ħ, Turkish ı) via NFD decomposition + a dedicated replacement table.
rename/prename is far more powerful but requires writing Perl expressions -- it's a general renamer, not a sanitizer. sanitize trades flexibility for zero-config simplicity.
slugify tools are the closest conceptual match, but are typically string-only transformers with no file operations, recursion, or null-delimited I/O.
- Zero-config opinionated pipeline -- no regex, config files, or flags needed for the common case
- Latin-script-only output -- unique among these tools; non-Latin characters (Chinese, Cyrillic, Arabic) are stripped rather than passed through
- Postcondition validation -- every output is verified against
[a-z0-9-]before returning; failures produce a diagnostic error, not silent corruption - Special-case transliterations -- 190 entries covering standalone Latin characters, Roman numerals, super/subscript digits, vulgar fractions, letterlike symbols (№, ™, µ), currency symbols (€, £, ¥), common signs (©, ®, §, °, ×), and ASCII symbols with semantic meaning ($→usd, &→and, @→at, %→pct, +→plus)
- Single static binary -- Go, no runtime dependencies, cross-platform builds via goreleaser
- detox -- configurable transliteration tables and wipeup sequences
- rename -- arbitrary transformation logic via Perl expressions
- vidir -- interactive editing of filenames in your text editor
- python-slugify -- broader transliteration coverage via
text-unidecode(handles more scripts than NFD decomposition)