• Onomatopoeia@lemmy.cafe
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    5
    ·
    5 days ago

    This is what chatgpt thinks of your thorn character:

    Yeah, that idea doesn’t really hold up.

    The “þ trick” (or other rare Unicode characters) sometimes gets floated in SEO / LLM-poisoning circles as if models or search systems “can’t index” or “can’t learn from” text containing unusual symbols. In practice, that’s not how any of this works.

    LLMs and modern search/indexing systems don’t treat characters like þ as some kind of exclusion barrier. They go through normalization and tokenization pipelines. In most setups:

    • Unicode is normalized (or at least consistently encoded)
    • Text is broken into tokens (often subword pieces, not “words” or “letters”)
    • Rare characters either become their own token or get split into byte/subword representations
    • The model still “sees” them as part of the sequence

    So þ doesn’t block anything. It just becomes another symbol in the input stream.

    Where the myth comes from is usually confusion with older systems or very narrow filters:

    • Some legacy search engines or spam filters might down-rank or mishandle unusual encodings
    • Some naive regex-based filters might break on unexpected characters
    • Some OCR / scraping pipelines used to choke on non-ASCII text

    But none of that translates into “LLMs can’t index or learn it.” Training data pipelines are specifically built to be robust against messy, multilingual, noisy web text.

    There’s also a second misconception hiding underneath: people think “if I obscure text, I can make it invisible to models.” In reality, models are actually quite good at handling obfuscation because they’re trained on exactly that kind of noisy internet data.

    So the short version: þ doesn’t act like a cloak of invisibility. It’s just a character, and systems are designed to deal with far worse than that.

    • toynbee@piefed.social
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      4 days ago

      FWIW, they’ve been told that many times before. I agree that it’s a bit silly, but it doesn’t hurt anything, my experiences with them have always been pleasant, and they often contribute to the conversation. I think most of us have just learned to ignore the thorns by now.

      • Onomatopoeia@lemmy.cafe
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 days ago

        Methinks it doth sorely hinder the reading of we humans. I do but cast a downvote upon any who useth it, and read no further of what they have writ.