Framework-independent PHP library for language-aware text moderation.
VerbaGuard normalizes text, matches dictionary terms with deterministic exact matching, and returns explainable results --- with zero runtime Composer dependencies.
For project principles and long-term boundaries, see FOUNDATION.md.
VerbaGuard is a small PHP moderation engine that detects profanity and insults in user-generated text. It is designed for applications that need:
- predictable, testable behavior
- language-specific normalization (Turkish out of the box)
- explainable matches with byte-accurate spans for masking
- no framework lock-in
It is a reference implementation, not a hosted moderation service. You bring your own dictionaries and integrate via a simple PHP API.
- Zero runtime dependencies --- only PHP 8.2+ and
ext-mbstring - Language profiles --- plug in dictionaries and normalizers per language
- Turkish profile --- built-in curated Turkish dictionary (20 entries in v0.4) and Turkish character normalization
- Obfuscation resistance --- leetspeak, repeated letters, and
separator tricks (
s.i.k.t.i.r) - Exact matching --- no substring false positives from embedded terms
- Explainable output ---
ProfanityMatchobjects with term, category, severity, and byte span - Masking --- preserve non-matching UTF-8 text around detected spans
composer require verbaguard/verbaguardRequirements
Requirement Notes
PHP ^8.2 Required
ext-mbstring Required
ext-intl Optional; enables Unicode NFC normalization
use VerbaGuard\VerbaGuard;
$guard = VerbaGuard::turkish();
$guard->contains('hello'); // false
$guard->score('amk'); // 25
$guard->mask('bu bir amk test'); // bu bir *** test
$result = $guard->analyze('amk'); // AnalysisResult$result = $guard->analyze('SİKTİR');
$result->hasProfanity(); // true
$result->score(); // 50
$result->severity(); // high
$result->matches(); // ProfanityMatch[]
$result->masked('*'); // ******$guard->contains('s.i.k.t.i.r'); // true — separator spelled-chain
$guard->contains('4mk'); // true — leetspeak token
$guard->contains('malzeme'); // false — no embedded substring match
$guard->contains('n o r m a l'); // false — full chain is "normal"Method Description
turkish(): self Guard with Turkish profile
forLanguages(array $profiles): self Guard with custom profiles
analyze(string $text): AnalysisResult Full analysis
contains(string $text): bool Whether any match exists
mask(string $text, string $mask = '*'): string Mask matched spans
Method Description
hasProfanity(): bool Any matches found
score(): int Sum of match severity weights
severity(): string Highest match severity
matches(): array List of ProfanityMatch
masked(string $mask = '*'): string Masked original text
Method Description
original(): string Matched substring
normalized(): string Normalized lookup form
term(): string Dictionary canonical term
language(): string Profile code
category(): string e.g. profanity, insult
severity(): string low, medium, high
start(): int Byte offset in original UTF-8
length(): int Byte length in original UTF-8
Integrate through these symbols only:
VerbaGuard\VerbaGuardVerbaGuard\AnalysisResultVerbaGuard\ProfanityMatchVerbaGuard\SeverityVerbaGuard\Contracts\LanguageProfileVerbaGuard\Dictionary\DictionaryVerbaGuard\Dictionary\EntryVerbaGuard\Language\TurkishProfileVerbaGuard\Normalizer\Normalizer(extension interface)
Not part of the public contract. May change without notice:
VerbaGuard\Pipeline\PipelineVerbaGuard\Pipeline\MatcherVerbaGuard\Pipeline\TextSegmentsVerbaGuard\Pipeline\ScorerVerbaGuard\Pipeline\NormalizationPipeline- Concrete normalizers (
UnicodeNormalizer,TurkishNormalizer,LeetspeakNormalizer, etc.)
See docs/specification.md for full behavioral
details.
TurkishProfile is the built-in language profile for Turkish text
moderation.
use VerbaGuard\VerbaGuard;
$guard = VerbaGuard::turkish();It provides:
- Dictionary --- curated Turkish dictionary in
data/tr.php(20 entries in v0.4; intended as a reference implementation, not a production-complete lexicon) - Normalization --- Turkish lowercase and ASCII folding (
ş→s,ı→i,ğ→g, etc.)
Use VerbaGuard::forLanguages() with a custom profile when you need a
production dictionary.
A language profile bundles a dictionary and profile-specific normalizers.
v0.2 dictionary authoring: write only term, category, and
severity in dictionary rows. Do not author normalized --- it is
derived at build time via Dictionary::fromRows() and a normalizeKey
callable that must match the profile's runtime normalization chain.
use VerbaGuard\Contracts\LanguageProfile;
use VerbaGuard\Dictionary\Dictionary;
use VerbaGuard\Normalizer\Normalizer;
use VerbaGuard\Pipeline\NormalizationPipeline;
use VerbaGuard\VerbaGuard;
final class ExampleProfile implements LanguageProfile
{
public function code(): string
{
return 'ex';
}
public function dictionary(): Dictionary
{
$rows = [
[
'term' => 'badword',
'category' => 'profanity',
'severity' => 'medium',
],
];
$normalization = new NormalizationPipeline($this->normalizers());
return Dictionary::fromRows(
$rows,
static fn (string $term): string => $normalization->normalize($term),
);
}
public function normalizers(): array
{
return [
new class implements Normalizer {
public function normalize(string $text): string
{
return mb_strtolower($text, 'UTF-8');
}
},
];
}
}
$guard = VerbaGuard::forLanguages([new ExampleProfile()]);Multiple profiles can be passed; matches from all profiles are merged and deduplicated.
Global stages run in fixed order for every token and spelled chain:
Unicode NFC (when ext-intl available)
→ Language-specific normalizers (from profile)
→ Leetspeak map (4→a, @→a, 1→i, …)
→ Repeated letter collapse (aaa → a)
Normalization prepares text for exact dictionary lookup. It does not perform fuzzy or substring matching.
The matcher (v2.2, frozen) uses two deterministic paths:
Letter/digit runs are tokenized, normalized, and looked up with exact
equality. malzeme does not match mal.
Single-letter runs separated by punctuation or spaces form a chain. The
full concatenated chain is normalized and matched exactly. s i k t i r
matches siktir; n o r m a l does not match mal.
Policy: false positives are worse than false negatives. Matches use byte-accurate spans --- no approximate offset mapping.
As of v0.4.0, VerbaGuard includes:
- 20 curated Turkish dictionary entries
- 210 Turkish corpus cases
- 87 clean
- 55 profane
- 56 obfuscated
- 12 edge
- 0 false positives
- 0 false negatives
- 100% detection coverage across the curated corpus
The corpus and expansion workflow are documented in:
docs/dictionary-expansion-policy.mddocs/batch1-tr-lexicon-research.md
VerbaGuard prioritizes correctness and explainability over aggressive recall.
- Small, readable codebase over clever abstractions
- Deterministic behavior over heuristic tuning
- Framework independence over ecosystem coupling
- Curated dictionaries over megadictionary bundles
- Open, inspectable matches over black-box scores
See FOUNDATION.md for the full principles document.
- O(n) passes over input text
- No runtime Composer dependencies
- Minimal allocations in the hot path
- No reflection in matching
- Suitable for per-request moderation in web applications
Benchmarks are not published yet; performance work follows correctness and API stability.
- Reference dictionary ---
data/tr.phpcontains a small curated Turkish lexicon (20 entries in v0.4). It is intentionally conservative and should be expanded for production deployments. - Short terms --- very short entries like
maloraqmatch when they appear as standalone tokens in otherwise innocent sentences. ext-intloptional --- without it, Unicode NFC normalization is skipped.- No NLP context --- homonyms, sarcasm, and intent are not evaluated.
Contributions are welcome. Please read:
- CONTRIBUTING.md --- workflow and commit standards
- FOUNDATION.md --- project principles
- docs/specification.md --- behavioral contract
Run tests before opening a pull request:
composer install
composer testThis repository contains a minimal seed dictionary with explicit profanity and insults for automated testing only. The words are intentionally offensive and are included solely to verify detection, scoring, and masking behavior. Do not use them outside test contexts.
MIT. See LICENSE.