add check censored udf, tokenize udf by haileyok · Pull Request #121 · roostorg/osprey

haileyok · 2026-01-21T03:21:44Z

I originally wrote this without a translation table, but decided that because the produced regex were so large that it would be best to just create that table instead. The functionality of this PR and the previous one is the same, though
with fairly reduced complexity. It adds two new string UDFs:

StringTokenize, which converts the given text into a list of individual tokens (split at whitespace or punctuation). Not strictly necessary for this PR, but useful for it and feels like a good time to add it in
CheckCensored, which builds a regex for a given input phrase then checks a given input token against the regex

There's some existing functionality for lookalikes in string.py, specifically the StringClean UDF. There's a variety of additional things that I've added in this new UDF though (that have also come up quite a bit in the wild)

Still will match terms that attempt to obfuscate with "separator" characters, i.e. using "c___a___t" to try and get around matching for "cat"
Handles zero-width spaces that are often used for the same purpose
Handles a larger list of characters
Allows for matching only when a given term is obfuscated, i.e. it's fine to use "cat" but not okay to use "<4t"

There's likely still a bit of overhead from using regex here to achieve some more useful results from the current implementation of this, but we get some additional info/control by using them.

BinaryFiddler · 2026-01-21T20:33:16Z

osprey_worker/src/osprey/engine/stdlib/udfs/string.py

+
+
+def tokenize_text(s: str) -> list[str]:
+    s = s.replace("'", "'").replace('ʼ', "'")


Actually, what does first replace do?

ah nice catch. there are two different apostrophes we want to replace with a "normal" apostrophe, and it looks like that first one actually might have gotten replaced with a normal apostrophe maybe through copy/paste or something. switched it to just use unicode escapes so its more obvious what it's doing

a5fab61

test 103d03f

add check censored udf, tokenize udf

03bdd59

haileyok requested review from a team, BinaryFiddler, EXBreder, ayubun, cmttt and jaredmiller13 as code owners January 21, 2026 03:21

fix typo

a83ff35

BinaryFiddler approved these changes Jan 21, 2026

View reviewed changes

BinaryFiddler reviewed Jan 21, 2026

View reviewed changes

haileyok added 3 commits January 21, 2026 12:40

add additional context to docstring

d05d27d

explicitly use unicode escapes in tokenize replace

a5fab61

more explicit curly apostrophe test cases

103d03f

haileyok mentioned this pull request Jan 24, 2026

add lists udf #112

Draft

haileyok marked this pull request as draft March 9, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add check censored udf, tokenize udf#121

add check censored udf, tokenize udf#121
haileyok wants to merge 5 commits intomainfrom
hailey/add-censor-tokenize-udfs

haileyok commented Jan 21, 2026

Uh oh!

BinaryFiddler Jan 21, 2026

Uh oh!

haileyok Jan 21, 2026 •

edited

Loading

Uh oh!

haileyok Jan 21, 2026

Uh oh!

BinaryFiddler Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def tokenize_text(s: str) -> list[str]:
		s = s.replace("'", "'").replace('ʼ', "'")

Conversation

haileyok commented Jan 21, 2026

Uh oh!

BinaryFiddler Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

haileyok Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haileyok Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

BinaryFiddler Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haileyok Jan 21, 2026 •

edited

Loading