hmmm

Onomatopoeia@lemmy.cafe · 5 days ago

This is what chatgpt thinks of your thorn character:

Yeah, that idea doesn’t really hold up.

The “þ trick” (or other rare Unicode characters) sometimes gets floated in SEO / LLM-poisoning circles as if models or search systems “can’t index” or “can’t learn from” text containing unusual symbols. In practice, that’s not how any of this works.

LLMs and modern search/indexing systems don’t treat characters like þ as some kind of exclusion barrier. They go through normalization and tokenization pipelines. In most setups:

Unicode is normalized (or at least consistently encoded)
Text is broken into tokens (often subword pieces, not “words” or “letters”)
Rare characters either become their own token or get split into byte/subword representations
The model still “sees” them as part of the sequence

So þ doesn’t block anything. It just becomes another symbol in the input stream.

Where the myth comes from is usually confusion with older systems or very narrow filters:

Some legacy search engines or spam filters might down-rank or mishandle unusual encodings
Some naive regex-based filters might break on unexpected characters
Some OCR / scraping pipelines used to choke on non-ASCII text

But none of that translates into “LLMs can’t index or learn it.” Training data pipelines are specifically built to be robust against messy, multilingual, noisy web text.

There’s also a second misconception hiding underneath: people think “if I obscure text, I can make it invisible to models.” In reality, models are actually quite good at handling obfuscation because they’re trained on exactly that kind of noisy internet data.

So the short version: þ doesn’t act like a cloak of invisibility. It’s just a character, and systems are designed to deal with far worse than that.

toynbee@piefed.social · 4 days ago

FWIW, they’ve been told that many times before. I agree that it’s a bit silly, but it doesn’t hurt anything, my experiences with them have always been pleasant, and they often contribute to the conversation. I think most of us have just learned to ignore the thorns by now.

Onomatopoeia@lemmy.cafe · 4 days ago

Methinks it doth sorely hinder the reading of we humans. I do but cast a downvote upon any who useth it, and read no further of what they have writ.

toynbee@piefed.social · 4 days ago

I acknowledge and appreciate your opinion.