the arms race of message filtering

2025/08/15 social networks

What a town in England has to do with filtering 500k messages

About two years ago, I made a simple Discord bot for my friend who was starting to build an online following. There weren’t many features at first, but it did its job and provided a funny gimmick for people joining the server. Little did I know, he was about to blow up and balloon the server to over a thousand users.

Sounds great, right? Well… here’s

The Problem

People on the Internet do dumb stuff. Really, really dumb stuff. That includes, of course, saying things they probably shouldn’t be saying. There is probably no better way to convince yourself of this than to sit in an unmoderated chatroom and watch.

If we don’t want other people to be subjected to this, though, we need some way of moderating it. There are two ways people go about this usually, and even together, neither of them are great at tackling the problem.

Human Moderation

Most chatrooms try to solve this problem by simply adding more moderators. It makes sense, a human’s discretion is hard to beat with automated systems, but that discretion comes at a hefty cost.

Time

Unless all these moderators do is stare at the chatroom all day, they are likely going to have to wait until either:

  1. they see the message themselves, or
  2. someone alerts them to a bad message.

This takes time, and chatrooms move fast. The people that you would’ve liked not to see it, have already seen it. Rather than an instant rejection, human moderators typically serve to provide punishment at a later time.

Power

It wouldn’t be the Internet if people didn’t get really weird with small amounts of power. As you increase the number of moderators, you are giving more people the ability to ban, kick, and delete messages at their discretion. Does that match your discretion? Other moderators’ discretions? Probably not, and it is likely impossible to perfectly align these.

So, because of these limitations, chatrooms will typically also opt for simple, centralized, automated filtering, like wordlist blocking.

Wordlist Blocking

Wordlist blocking is simple. I make a list of the words I don’t want people to say, and if they say it, it gets deleted!

wordlist.txt
------------
pumpernickel
bagel
wheat
bun
# Pseudo-algorithm
if any word in wordlist:
    reject()
else:
    accept()
some_user:
    That's a sesame bun
Message deleted!

Do you see a problem? Try to beat this filter and find a way to say bun! It shouldn’t be that hard, I hope:

slightly_smarter_user:
    
Message denied!
If you can't get it...
slightly_smarter_user:
    That's a sesame bunn
Message accepted!

This kind of filtering is trivial to beat. Add a space, change the spelling around, and you can say whatever you want.

Of the ways around this kind of filtering, “padding” is pretty common. Slap an extra letter on the end or the front, and you get a completely readable message like aBAGELa or aPUMPERNICKELa.

“Hah!”, the arrogant algorithm designer proclaims, “instead of checking if any words are equal, I’ll check if any blocked word is anywhere in the message!”

# Arrogant Algorithm Designer: "FIXED"
if message contains any blocked word:
    reject()
else:
    accept()
some_user:
    That's a sesame aBUNa.
Message deleted!

Well, so far so good, it blocked the message! It looks like a well-meaning user just joined the chat. Let’s see what they have to say!

well_meaning_user:
    There's an abundance of flowers this season.
Message deleted!

Uh oh. “bun” appears in a lot of words that we don’t want to block.

The Scunthorpe Problem

What we’ve just stumbled upon is not new, however, especially for the residents of Scunthorpe, England. In 1996, they encountered an issue when attempting to make an AOL account. The profanity filter wouldn’t let them include the name of their town! Later on, Google would make the same mistake as the “arrogant algorithm designer” did above, and hid local businesses from their search tab. Naturally, this phenomenon would come to be known as the Scunthorpe Problem.

A sign bearing the town's name, Scunthorpe

My Solution

If AOL and Google could get tripped up by this, I knew my little Discord bot didn’t stand a chance without something smarter. I started wondering if there was a simple rules-based pipeline that could:

  • Minimize false positives (normal messages being deleted à la “abundance”)
  • Minimize false negatives (bad messages being allowed)
  • Prevent circumvention (padding, special characters, etc)

That’s the question I set out to solve when designing a filter module for this bot, and I decided to enter myself into a trial-by-fire.

Manual “Adversarial Learning”

Humans are really clever at breaking stuff, and so I used that as an advantage. Our server had a little economy system, so after I set up the basic filtering system, I opened up a challenge to everyone in the server:

The Challenge

aiden:
If you can get my filter to block any messages it shouldn't
aiden:
or allow a message it shouldn't (false negatives)
aiden:
you will be handsomely rewarded with some server money.

Instantly, people started flooding in trying to break my filter with some toy words I set out (and not the actual blocked ones). First, they tried to simply say them. Obviously, this was caught by my direct wordlist filtering.

some_user:
    I like pumpernickel.
Message deleted!

But when you put up a wall, people just build a bigger ladder. They moved onto padding attacks, messages like:

slightly_smarter_user:
    That's a sesame bunn
Message accepted!

Levenshtein Distance

To defeat padding and other attacks to our filter, we might consider the fact that most circumventions are just simple edits to the original word.

For example, if a user discovers bun is blocked, they might go back and just add an n (bunn). If they discover pumpernickel is blocked, they’ll likely just repost it as pumpernickl. These are simple edits: additions, deletions, substitutions. So, it might be useful to know the “edit distance” between the words users are posting and the words we want to block.

This is captured in a metric known as the Levenshtein Distance, which does just that. Every edit is assigned a “cost”, which adds up to the total “distance” between the two words. In other words, how many keystrokes would it take to turn their message into the banned word?

A BK-tree with Levenshtein distance as edge costs

A BK-Tree with Levenshtein distance as edge costs

Let’s walk through the Levenshtein distance between our blocked word, pumpernickel, and an input pumprenickl. Simple word blocking wouldn’t catch it, but we can see the distance is:

  1. Substitution (swap r & e)
    • pumpernickel -> pumprenickel
  2. Deletion (delete e)
    • pumprenickel -> pumprenickl

Looks like we have a Levenshtein distance of 2. So, we might choose a threshold where we consider a word to be “too close” to our blocked word.

# For example...
if levenshtein(blocked,word) <= 2:
    reject()
else:
    accept()
My threshold algorithm

Instead of using a flat threshold, which can tend towards false positives for small words/messages, my system uses

match len(content):
    case 1 | 2 | 3:
        threshold = 1
    case 4:
        threshold = 2
    case _:
        threshold = 3

Dictionary & Word Frequency

Now that we have a tool to check the “close enough”s, let’s see if it holds up in some real world examples.

some_user:
    Hey, can moderators ban this guy?
Message deleted! (ban close to bun)

Uh oh. ban is only one letter away from bun! Our Levenshtein distance is less than two, and so the message was deleted. That’s not good, some words in English are close to others that we might want to block (e.g., cant is only one letter from a common one). The way that I solved this problem was to take in an English dictionary (+ some Internet slang) and determine what words are real words and which are circumventions.

dictionary = ["a","and","apple", ...]
if any word <= 2 edits away from blocked
    if word isnt in dictionary:
        reject()
else:
    accept()
some_user:
    Hey, can moderators ban this guy?
Message accepted! (ban in dictionary)

However, perfect spelling is hard to come by in Internet chatrooms, and so another clever use of Levenshtein distance was deployed to determine if a word is a “likely misspelling” of a common word. Using word frequency data, I was able to catch real misspellings and ignore obscure words that would almost never be used in this context.

Unique-ify

Checking the “edit distance” doesn’t come without problems though. While the first “padding attacks” were stopped by the Levenshtein check outlined above, the users didn’t stop there, they kept going.

some_user:
    I love sesame bunnnnnnnnns
Message accepted! (not close enough to bun)

Uh oh. That’s a lot of edits away from “bun”; a lot more than just two. My solution to this problem was to implement another branch in my filtering, the unique-ify branch. This took the words and removed all duplicate letters. bunnnnns became buns . This solved the aggressive padding approach, but brought some of its own (pumpernickel became pumernickl). In this branch, my other parameters needed to be slightly more “kind” because it would necessarily make the inputs a little weirder.

An image of a homoglyph translator making plain ASCII look fancy.

Then came the “homoglyph” attacks, where people used commonly available “translators” online to turn their text into “𝓯𝓪𝓷𝓬𝔂” text, or any other sort of characters that look like letters. As more and more obscure ones came into the channel, I needed some way to read them as the characters they look like, and not the unrelated characters they really were.

Homoglyphs

One of the hardest problems to solve for is the fact that, many characters (glyphs) in Unicode look extremely similar (there are 292,531 total glyphs!). Which means that even if you block the letter and all of its accent forms, you probably didn’t think someone would use “𖤀” for “d”. They will.

These similar-looking characters are called homoglyphs, and a lot of my work as a “filterer” was finding these so I could normalize the input text.

The foundation of my filter system is still a wordlist, just like the troubled examples I gave before. However, when we take in messages to compare it to the wordlist, we feed it through some filters to normalize the homoglyphs into “most pure ASCII as possible”.

  • Translate Unicode homoglyphs to ASCII (𐌁𐌵𐌌𐌁𐌋𐌉𐌍Ᏽ -> bumbling)
  • Transform any “leetspeak” to ASCII (h3110 -> hello)
  • Transform all text to NFKD form (àbúñdäņčė -> abundance)
  • Normalize pig latin (ustjay orfay unfay -> just for fun)
My homoglyph file
a: 𝒂, ą, 卂, ค, а, ᵃ, ム, ǟ, ₐ, ⲁ, ꍏ, 𐌀, Ꮧ, 𝐚, ɐ, α, 🅰, 4, Æ, æ, 𝟃, ẳ, å, @, 🇦, ል, ꛎ, ᗣ, ꁲ, Թ, 𝝰, ⍲, ⠁, .-, Δ, ᗩ, ꅔ, ⓐ, ᚣ, ꮧ, 𝛼, 𐤠, 𝞪, λ, 𝛂, ₳, 🅐, ꋬ, ⍺, ᴀ
b: ᵬ, Ϧ, 乃, ๒, ъ, ᵇ, ც, ɮ, ♭, ҍ, ⲃ, ꌃ, ь, 𐌁, Ᏸ, 𝕓, 🅱, 8, 6, ƀ, ɓ, в, 🇧, ፪, ꔪ, ꃳ, Յ, 𝝱, ⌦, ᖯ, ⠃, -..., ꋣ, b, Ƃ, ᏸ, 𝛽, Ɓ, ᴮ, 𝞫, Ⴆ, ᗷ, 𐒈, 𝛃, ฿, 🅑, ß, ꉉ
c: 𐒨, ᑢ, 匚, ς, с, ᶜ, ᄃ, ƈ, 𝓬, ç, ⲥ, ꉓ, 𐌂, ፈ, Ć, ɔ, 🅲, ©️, 𝗖, ㄈ, ©, 🇨, ር, ꛕ, ᙅ, ꏳ, Շ, ૮, 𝞁, ⍧, ⠉, -.-., ☾, ꄡ, ɕ, ᛈ, 𝜍, Ƈ, 𝞻, 𝛓, ₵, 🅒, Ȼ, ℃, ☪️
d: 𝙙, đ, ᗪ, ๔, ↁ, ᵈ, り, ɖ, 𝓭, ժ, ⲇ, ꀸ, 𐌃, Ꮄ, Đ, 🅳, ԁ, ɗ, ȡ, d, ᴰ, 🇩, ⊃, ጋ, 𖤀, ꀷ, Ժ, ∂, 𝝳, ⟄, ⅾ, ⠙, -.., ᕲ, ꁕ, D, ꮄ, 𝛿, д, Ɗ, 𝞭, 𝖽, ԃ, Ꮷ, 𝛅, 🅓, ꌛ
e: 𝟈, Ҽ, 乇, є, э, ᵉ, ɛ, ₑ, ҽ, ⲉ, ꍟ, е, 𐌄, Ꮛ, 𝑒, ǝ, ⓔ, 🅴, 3, 𝖾, ȇ, ᴲ, 🇪, ⪽, 𖤢, ቹ, ᙓ, ꑀ, ȝ, ε, 𝝴, ℇ, ⠑, ., €, ꁄ, 𝐄, ᛊ, ꮛ, 𝜀, ҿ, Ƹ, 𝞮, ᘿ, 𐒢, 𝛆, Ɇ, 🅔, ℮, ϵ, ᴇ
f: 𝚏, ⨎, 千, Ŧ, ᶠ, キ, ʄ, ᵳ, ƒ, 𝓯, ꎇ, 𐌅, Ꭶ, ⓕ, ɟ, 𝐟, 🅵, ⨏, ӻ, f, 🇫, ꘘ, ቻ, ꊯ, Բ, 𝗳, 🜅, ⠋, ..-., ғ, ᖴ, Ϝ, ꌺ, 𝕗, ꭶ, 𝑓, Ƒ, 𝙛, ⳨, ϝ, Ӻ, 𝒇, ₣, 🅕, ʃ, ℉
g: 𝙜, ဌ, ꮆ, Ꮆ, ﻮ, Б, ᵍ, ム, ɠ, ɢ, 𝑔, ց, 𝓰, ꁅ, Ᏽ, 𝕘, ƃ, 🅶, 9, ĝ, g, 🇬, ꚽ, ፏ, ᘜ, Գ, 𝗴, ⅁, ɡ, ⠛, --., ق, 𝓖, ԍ, Ɠ, ᴳ, ⳋ, Ⳓ, 𝒈, ₲, 🅖, Ģ, ꍌ, &
h: 𝒉, Ⴌ, 卄, ђ, Ђ, ʰ, ん, ɧ, ɦ, ₕ, հ, ⲏ, ꃅ, 𐋅, Ꮒ, 𝓱, ɥ, h, 🅷, |-|, ♓, ȟ, ħ, 🇭, ꛅ, ⶴ, ꁝ, 𝝺, ℍ, һ, ⠓, ...., ♄, ꀟ, H, ꖾ, ꮒ, 𝜆, н, Ƕ, ᴴ, 𝞴, ԋ, ᕼ, 𐒅, 𝛌, Ⱨ, 🅗, ꈚ
i: Ǐ, Ȋ, 丨, เ, і, ⁱ, ノ, ı, ɨ, ᵢ, ì, ⲓ, ꀤ, 𐌉, Ꭵ, Ɨ, 𝒾, 🅸, 1, ׀, í, i, 🇮, ꛈ, ጎ, Ꙇ, ꒐, ɿ, 𝗶, ⟟, ᴉ, ⠊, .., ♗, ꀧ, ᛨ, ꭵ, 𝑖, ї, Ɩ, 𝙞, ⳕ, ι, ᓰ, 𝒊, ł, 🅘, ί, ꊛ, ɪ
j: 𝙟, ذ, フ, ן, ј, ʲ, ʝ, ⱼ, 𝓳, ꀭ, Ꮭ, Ꮰ, J, ɾ, 🅹, ǰ, j, 🇯, ꚠ, ፓ, ꒑, 𝗷, ⏎, ϳ, ⠚, .---, ♪, ꆽ, Ꮦ, נ, J, ꮰ, 𝑗, ᴶ, ⳗ, ᒚ, Ꮽ, 𝒋, 🅙, ĵ, ꋒ
k: 𝗞, Ԟ, ҝ, к, ᵏ, ズ, ƙ, ӄ, ₖ, ҟ, ⲕ, Ҝ, ꀘ, 𐌊, Ꮶ, Ⓚ, ʞ, 🅺, ₭, κ, 🇰, 𖢉, ኡ, К, ꈵ, ҡ, 𝝹, ⏧, k, ⠅, -.-, ϰ, 𝓀, ᛕ, ꮶ, 𝜅, Ƙ, 𝞳, ᔌ, Ꮵ, 𝛋, 🅚, ㏍, 🎋
l: Į, ⎩, ㄥ, ɭ, ˡ, レ, Ɩ, ʟ, ₗ, Ӏ, 𝓵, ꒒, 𐌋, Ꮭ, ᒪ, ן, ⓛ, 🅻, ӏ, ▕, ȴ, |, l, 🇱, ꚳ, ረ, ʅ, ℓ, 𝝸, ⎾, ⅼ, ⠇, .-.., ↳, 𝓛, ᚳ, ꮭ, 𝜄, г, ᴸ, 𝞲, ⳑ, Ꮣ, 𝛊, Ⱡ, 🅛, ι, ꅤ, 🫷, 🕒
m: ɱ, ⫙, 爪, ๓, м, ᵐ, ᄊ, ʍ, ₘ, ⲙ, ꂵ, 𐌌, Ꮇ, ɯ, Μ, 🅼, ♍, 〽️, ♏, Ⓜ️, ₥, 🇲, 𖢑, ጮ, ᙏ, 𝗺, ⍓, ⅿ, M, Ϻ, ᛖ, Ⅿ, M, m, ⠍, --, ᗰ, ꉈ, 𝐌, ᛗ, ꮇ, 𝑚, ѫ, 𐒄, 𝙢, ᘻ, 𝒎, 🅜, ꀪ
n: ɳ, ᑏ, 几, ภ, и, ⁿ, 刀, ŋ, ռ, ₙ, ղ, ⲛ, ꈤ, п, 𐌍, Ꮑ, 𝐍, ℕ, 🅽, /\\/, ո, ♑, ♌, Ŋ, 冂, ň, 🇳, ∩, ꛘ, ክ, ꃔ, Ռ, 𝝶, ☊, ᥒ, ⠝, -., ℵ, ꍈ, ᚺ, ꮑ, 𝜂, Ɲ, ᴺ, 𝞰, ᘉ, 𐒐, 𝛈, ₦, 🅝, ɴ, ꁣ
o: 𝗢, ◯, ㄖ, ๏, о, ᵒ, の, ơ, օ, ₒ, ⲟ, ꂦ, Ꝋ, Ꭷ, Ø, 0, ⭕, 🅾, ό, ο, 🇴, 𖣠, ዐ, O, ꊿ, Ծ, σ, 𝝷, ⌾, ⠕, ---, ᓍ, ⊙, ꅂ, 𝕆, ᛜ, ꭷ, 𝜃, ѳ, Ⱉ, 𝞱, 𐒀, 𝛉, 🅞, Θ, 🫶, ➰, ꇩ, ☯️, ☮️, ☸️, 🌑, 🌒, 🌓, 🌔, 🌕, 🌖, 🌗, 🌘, 🌚, 🌝, ⭕, 🔴, 🟠, 🟡, 🟢, 🔵, 🟣, 🟤, ⚪, ⚫, 🔘, 🏀, ⚽, 🎱, 🪐, 🌎, 🌍, 🌏, 📿, 🍅, ⚽, 🪩, 🏀, ⚾, 🎱, 🥎, 🏐, 🧶, 🎯, ᴏ
p: 𝙥, ᑶ, 卩, ק, р, ᵖ, ア, ℘, ք, ₚ, ⲣ, ꉣ, 𐌐, Ꭾ, ᑭ, 🅿, ϸ, ᵽ, ƥ, ℗, 🇵, ꛤ, የ, ρ, 𝞀, ⍴, ⠏, .--., 𐌓, 𝕡, ᚹ, ꭾ, 𝜌, Ꝓ, ᴾ, 𝞺, ⳏ, ᕵ, Ꮅ, 𝛒, ₱, 🅟, ꀆ
q: 𝙦, ૧, ɋ, ợ, ゐ, զ, 𝓺, Ɋ, ꆰ, 𐌒, Ꭴ, ᵠ, 🆀, ԛ, գ, ʠ, q, 🇶, ꚩ, ዓ, ᕋ, ꋠ, φ, 𝞅, ℚ, ⠟, --.-, ꌜ, 𝐐, ꭴ, 𝜑, ҁ, Ꝗ, ᵩ, 𝞿, ⲫ, ϙ, ᕴ, 𐒉, 𝛗, Q, 🅠, ƣ
r: 𝗿, ┏, 尺, г, ѓ, ʳ, ཞ, ʀ, ᵣ, ɾ, ꞅ, ꋪ, 𐌓, Ꮢ, я, ɹ, ℝ, 🆁, ®️, ŗ, ŕ, ®, 🇷, ዪ, 𖦪, ꌅ, Ր, ૨, 𝝲, ☈, r, ⠗, .-., ꎡ, 𐌐, ⓡ, ꮢ, 𝛾, Ɽ, ᴿ, 𝞬, ⲅ, ᖇ, Ⲅ, 𝛄, 🅡, ર
s: 𝙎, ى, 丂, ร, ѕ, ˢ, ʂ, ֆ, ₛ, 𝛓, ꌗ, 𐌔, Ꮥ, 🆂, $, 5, 💲, 𝗦, ȿ, š, 🇸, ነ, ᔑ, ꕷ, ꈜ, Տ, 𝘀, ⎎, ⠎, ..., ∫, ꉖ, ᛢ, ꮥ, 𝑠, ϛ, Ⳝ, 𝙨, ⳽, S, Ꮄ, 𝒔, ₴, 🅢, Ș, ꈛ
t: †, ✝, ㄒ, Շ, т, ᵗ, イ, ɬ, ȶ, ₜ, է, ⲧ, ꓄, 𐌕, Ꮦ, 𝐓, ʇ, 𝐭, 🆃, 7, ✝️, ☦️, ➕, ŧ, 丅, ţ, ¶, 🇹, ፕ, 𖢧, ꋖ, Ե, ƭ, 𝞃, ⍑, t, ⠞, ᖶ, ꇞ, ᛠ, ꮦ, 𝜏, Ƭ, ᵀ, 𝞽, ƚ, Ꮏ, 𝛕, ₮, 🅣, τ
u: 𝗨, 𝛍, ㄩ, ย, ц, ᵘ, ひ, ų, ʊ, ᵤ, մ, 𐌵, ꀎ, Ꮼ, 🆄, ս, ⛎, ȕ, û, u, ῡ, 🇺, ∪, ፱, ᙀ, ꚶ, ꌈ, Մ, µ, 𝝻, ⌰, ᥙ, ⠥, ..-, ☋, ꒦, 𝐮, Ꮜ, ꮼ, 𝜇, ꓴ, 𝞵, ⳙ, υ, ᑘ, 𐒜, Ʉ, 🅤, Ʋ, ꀀ, 🤘
v: 𝘃, ✓, ᐯ, ש, ᵛ, √, ۷, ʋ, ᵥ, ѵ, 𝓿, ꃴ, ᕓ, Ꮙ, v, ʌ, 𝓋, 🆅, ν, ♈, ѷ, 🇻, ህ, ꚴ, ꒦, ע, 𝝼, ⍻, ᴠ, ⠧, ...-, V, ꮙ, 𝜈, Ʋ, ⱽ, 𝞶, ⳳ, ᐺ, 𝛎, 🅥, ℣, 🖖
w: 𝙒, ᗯ, 山, ฬ, ш, ʷ, W, ῳ, ա, 𝔀, ⲱ, ꅏ, ᴡ, Ꮤ, Ꮗ, ʍ, ώ, 🆆, ԝ, 𝖶, w, 🇼, ሠ, ᙎ, ꛃ, ꅐ, ω, 𝞏, ⏙, ⠺, .--, ꋃ, ꮗ, 𝜛, ѿ, Ⱳ, ᵂ, 𝟉, ɯ, ᘺ, Ꮚ, 𝛡, ₩, 🅦, ꂸ, 🖐️, 👐
x: Ҳ, ㄨ, 乂, א, х, ˣ, メ, ҳ, Ӽ, ₓ, ×, ⲭ, ꊼ, 𐋄, ጀ, χ, 🆇, ❌, ✖️, Х, ẍ, ẋ, 🇽, ӽ, ሸ, 𖤗, ꉤ, Ճ, 𝞆, 🝍, ⠭, -..-, ⌘, ᚾ, 𝜒, ж, 𐊴, 𝟀, ᙭, 𐒎, 𝛘, Ӿ, 🅧, Χ, ꊩ, 🤞, 🫰
y: Ⴘ, Ɏ, ㄚ, ץ, Ў, ʸ, リ, ყ, ʏ, ᵧ, վ, ⲩ, ꌩ, у, 𐌙, Ꭹ, 𝔂, ʎ, 𝕐, 🆈, ƴ, 🇾, ሃ, Ƴ, ꚲ, ꐔ, Վ, 𝞇, ⍦, ⠽, -.--, ⚧, ꒄ, Y, ᚴ, ꭹ, 𝜓, ѱ, 𝟁, ᖻ, 𐒍, 𝛙, 🅨, ϓ, ꌦ
z: Ȥ, 𝘡, 乙, չ, ᶻ, ʑ, ʐ, 𝆎, Հ, ⲍ, ꁴ, Ɀ, ፚ, 𝓩, 𝔃, 🆉, 2, ȥ, ž, z, 🇿, ጊ, ꛉ, ꑒ, ƶ, 𝘇, ☡, ᴢ, ⠵, --.., ꋴ, ℤ, ᛇ, 𝑧, ԑ, 𝙯, ⲹ, ᗱ, ೩, 𝒛, Ⱬ, 🅩, ꍈ
free: 🆓
cool: 🆒
ng: 🆖
id: 🆔
up: 🆙
new: 🆕
vs: 🆚
ab: 🆎
cl: 🆑
sos: 🆘
wc: 🚾
off: 📴
end: 🔚
back: 🔙
on: 🔛
top: 🔝
knee: 🦵
soon: 🔜
69: ♋
you: 🫵
1: ❶, ①, 1️⃣
2: ❷, ②, 2️⃣
3: ❸, ③, 3️⃣
4: ❹, ④, 4️⃣
5: ❺, ⑤, 5️⃣
6: ❻, ⑥, 6️⃣
7: ❼, ⑦, 7️⃣
8: ❽, ⑧, 8️⃣
9: ❾, ⑨, 9️⃣
0: ⓿, ⓪, 0️⃣
.: dot
-: dash

(Note: some of these aren’t single glyphs, e.g. the morse code, or /\/\ for “m”)

Convoluted attempts

As the filter got better, people took to crazy measures to get their messages to pass through. One of the simpler methods is just screenshotting a message and sending the image. This was easily defeated by feeding images through OCR and filtering on that text.

But as hard as I tried, people kept posting ever-obscure encodings of the words I was trying to block. It got to a point where I wasn’t sure you would even recognize it as the word if you hadn’t been embroiled in this battle. Emoji combinations, oddly transformed text, QR codes, anything you could dream of to encode the message by any means necessary.

Final System

Eventually, I ended up on something that looked like this flow:

1
Normalization Phase
Transform the input message into a standardized format by removing noise and converting various representations to ASCII.
Input: "Check out @user123 this 🅱️úññ recipe!"
Output: "check out this bunn recipe"
2
Parallel Matches
Apply multiple detection strategies in parallel. If any fail, reject the message
🔍
Direct Match
Check if blocked words appear directly in the normalized text
🔄
Reverse Match
Detect backwards spelling attempts (e.g., "nub" → "bun")
🔗
No-Spaces Match
Find matches when spaces are removed from the text
🎯
Uniqueify Match
Remove duplicate letters and check (e.g., "bunnnnn" → "bun")
📏
Levenshtein Distance
Calculate edit distance with spell-checking to avoid false positives

As a side effect of my strategies, the filter is very Latin-focused and would not work well for a language like Chinese, where the words aren’t space-delimited and homophonic puns are a more popular attack

While detecting homophonic attacks is interesting, most in English are formulaic and can be added manually to the blocklist. I did not feel like it was worth the effort in reducing the false positives introduced by a system like that.

How did it help?

It was truly an arms race between me and our server members, and there was no shortage of creative attempts to break my bot, which helped it become the best at what I set out for it, most importantly:

  • Minimize false positives (normal messages being deleted à la “abundance”)
  • Minimize false negatives (bad messages being allowed)

In some ways, this was like manual adversarial learning, where one system plays “defense”, and the other plays “offense”. Over time, my filter got more and more robust as I saw more and more examples of circumvention attempts. More homoglyphs to add to my list, more parameters to tweak (e.g. Levenshtein distance), more branches to add, etc.

Eventually, this system would be fed over 500,000 messages (some with images) and the filter processing was never the limiting factor in terms of speed. Network latency and Discord’s action queuing were significantly slower and presented the main bottleneck in minimizing the exposure to the messages.

Conclusion

Practically? If you’re building a filter:

  • Normalize input (homoglyphs, leetspeak, Unicode forms, OCR)
  • Check edit distances with smart thresholds
  • Maintain an evolving blocklist
  • Use dictionary checks to reduce false positives

Philosophically?

Like any arms race, there is never a winner, only holding the line until they think of some way to break it. But for now, my bot is winning the war. As time went on, my bot got better and better, and circumvention attempts became harder and harder to comprehend. This led me to ask the question:

Is someone who posts a blocked message behind a QR code really “breaking the filter”? If we don’t want people to see the underlying message, well, they’ve done it themselves!

So, while there’s never a win in this situation, if the goal of moderation is to minimize exposure to harmful messages, then the best victory might be forcing the attacker to hide it so well that nobody cares to decode it.

Document Information

Search

    Table of Contents