Skip to content

add check censored udf, tokenize udf#121

Draft
haileyok wants to merge 5 commits intomainfrom
hailey/add-censor-tokenize-udfs
Draft

add check censored udf, tokenize udf#121
haileyok wants to merge 5 commits intomainfrom
hailey/add-censor-tokenize-udfs

Conversation

@haileyok
Copy link
Collaborator

I originally wrote this without a translation table, but decided that because the produced regex were so large that it would be best to just create that table instead. The functionality of this PR and the previous one is the same, though
with fairly reduced complexity. It adds two new string UDFs:

  • StringTokenize, which converts the given text into a list of individual tokens (split at whitespace or punctuation). Not strictly necessary for this PR, but useful for it and feels like a good time to add it in
  • CheckCensored, which builds a regex for a given input phrase then checks a given input token against the regex

There's some existing functionality for lookalikes in string.py, specifically the StringClean UDF. There's a variety of additional things that I've added in this new UDF though (that have also come up quite a bit in the wild)

  • Still will match terms that attempt to obfuscate with "separator" characters, i.e. using "c___a___t" to try and get around matching for "cat"
  • Handles zero-width spaces that are often used for the same purpose
  • Handles a larger list of characters
  • Allows for matching only when a given term is obfuscated, i.e. it's fine to use "cat" but not okay to use "<4t"

There's likely still a bit of overhead from using regex here to achieve some more useful results from the current implementation of this, but we get some additional info/control by using them.



def tokenize_text(s: str) -> list[str]:
s = s.replace("'", "'").replace('ʼ', "'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what does first replace do?

Copy link
Collaborator Author

@haileyok haileyok Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice catch. there are two different apostrophes we want to replace with a "normal" apostrophe, and it looks like that first one actually might have gotten replaced with a normal apostrophe maybe through copy/paste or something. switched it to just use unicode escapes so its more obvious what it's doing

a5fab61

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@haileyok haileyok mentioned this pull request Jan 24, 2026
@haileyok haileyok marked this pull request as draft March 9, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants