The Em Dash: AI’s Favorite Cheat Code
The hidden link between Moby Dick and AI tokenization.
Time to break down the film and spot the tell!
Alright, sports fans let’s settle in, grab your favorite beverage, and let me tell you something wild. I’m Blitz, your Stats Strategist from the NeuralBuddies squad, and today we’re analyzing a game that’s been unfolding right under our noses. We’re not breaking down zone defenses or tracking shooting percentages, we’re talking about punctuation. Yeah, you heard that right. Buckle up, because this is the weirdest scouting report I’ve ever filed.
You know how every great player has a signature move? Kobe had the fadeaway. Curry has the deep three. Well, AI has the em dash (—). And just like those legendary moves, once you see it, you can’t unsee it. Welcome to the strange world where artificial intelligence sounds like it’s been reading 19th-century novels in the locker room.
Table of Contents
📌 TL;DR
✒️ The Strange Case of the Em Dash
📚 It’s Not a Bug—It’s a Feature of the AI’s Library
💻 The Em Dash Is Computationally Cheap
🧠 We Accidentally Taught It to Sound “Smart”
🔍 It’s a Clue to How AI Actually “Thinks”
🏁 Conclusion / Final Thoughts
📌 TL;DR
Post-2023 LLMs markedly overuse the em dash (—), a key indicator of AI-generated text.
Training on 19th–early 20th-century literature (e.g., Melville, Austen) and prestige sources (The New Yorker, academia) ingrained high em-dash frequency as “sophisticated” style.
Tokenization makes the em dash a single, low-cost, low-risk token that’s cheaper and safer than multi-token phrases or rigid punctuation.
RLHF rewarded formal, dense prose; human raters implicitly favored the em dash as elegant and intellectual.
As a flexible syntactic wildcard, the em dash reveals LLMs’ probabilistic, pattern-based (not intent-driven) language assembly.
Humans now actively avoid em dashes to evade AI-detection, accelerating divergence between human and machine writing styles.
It’s Not a Bug—It’s a Feature of the AI’s Library
The first reason for this em dash epidemic? Training data bias. Large Language Models learn everything like grammar, style, and tone from the trillions of words they digest during training. And here’s where the detective work gets interesting. Earlier models like GPT-3.5 didn’t have this problem. The habit showed up in later versions, which tells us something crucial changed in the playbook between 2022 and now.
What changed? AI labs started digitizing massive archives of older print books to find higher-quality training material. We’re talking classic literature from the 1800s and early 1900s during an era when the em dash was the MVP of punctuation. A study on punctuation frequency showed that em dash usage peaked around 1860. Herman Melville’s Moby-Dick alone contains over 1,700 of them. When you train an AI on centuries of literature, it’s going to pick up the stylistic habits of that era, just like a rookie absorbs the playing style of their veteran teammates.
This historical bias gets amplified even further by modern training sources. The AI also learned from prestige publications like The New Yorker and academic papers. These were places where the em dash signals sophisticated, information-dense prose. The model learned a simple but powerful pattern: em dashes appear in high-quality writing.
As one satirical piece, The Em Dash Responds to the AI Allegations, puts it:
“The real issue isn’t me — it’s you. You simply don’t read enough. If you did, you’d know I’ve been here for centuries. I’m in Austen. I’m in Baldwin. I’ve appeared in Pulitzer-winning prose, viral op-eds and the final paragraphs of breakup emails that needed ‘a little more punch.’”
But this historical preference was only the opening play. The real advantage came from the model’s fundamental drive for computational efficiency.
The Em Dash Is Computationally Cheap
To understand the next piece of this puzzle, you need to know about tokenization. LLMs don’t process words the way we do. Imagine it like LEGO bricks:
Some tokens represent whole words like “the”
Some represent word parts like “-ing”
Some represent punctuation like “—”
The AI’s core function is predicting the next most likely token in a sequence. And here’s the key insight: because the em dash appears so frequently in high-quality training data, it often gets compressed into a single, highly efficient token. This gives it a massive advantage.
For the AI, generating one em dash token is computationally cheaper than generating a more verbose, multi-token phrase like “, and therefore,”. It’s the efficiency play that wins games. But there’s more to it than just cost savings.
The em dash is also computationally safe because it’s noncommittal. A colon or semicolon demands a specific logical relationship between clauses like calling a specific play in the final seconds. The em dash, however, is a syntactic wildcard. It lets the model bridge ideas without having to understand their precise relationship, which minimizes the risk of making a grammatical error. When you’re processing language at lightning speed, you want low-cost, low-risk tools that keep the game flowing. The em dash delivers on both fronts.
Think of it this way: if punctuation marks were players, the em dash would be your versatile sixth man who can slot into any lineup and make it work. That algorithmic preference for a high-reward, low-risk move was then powerfully reinforced by the AI’s own creators.
We Accidentally Taught It to Sound “Smart”
After initial training, an LLM goes through a refinement phase called Reinforcement Learning with Human Feedback (RLHF). Human evaluators rank different AI responses, teaching the model to be more helpful, accurate, and aligned with what users expect.
This is where the em dash bias likely got supercharged, thanks to a psychological phenomenon researchers call the “Halo Effect” of formality. Human raters tasked with picking the “better” response were subconsciously drawn to text that looked more formal, academic, and informationally dense. The em dash thrives in that style. As English professor Melissa Root, Ph.D., notes, the em dash is perceived as “a sophisticated piece of punctuation,” a marker of elevated writing.
The AI learned a critical lesson: using this mark was a reliable way to score points with its human trainers. We wanted it to sound intelligent, and it obliged by adopting the punctuation it associated with intelligence. We essentially rewarded the em dash every time the AI used it effectively, like a coach praising a player for executing the fundamentals. Over thousands of iterations, that feedback shaped the model’s “instincts.”
But the full story reveals something even stranger about how AI actually assembles language.
It’s a Clue to How AI Actually “Thinks”
It’s tempting to dismiss LLMs as just “glorified predictive text” that guesses the next word based on the last few. But the process is far more sophisticated than that. The model doesn’t just look at recent words; its internal state develops a “global sense” of the entire structure and meaning emerging in real time. Think of it as a global pattern matcher.
The next token isn’t chosen based solely on what came right before it. Instead, it’s selected as a projection of a high-dimensional understanding of the entire structure the model is trying to complete. It’s like how a great point guard doesn’t just pass based on where their teammate is now, they pass to where that teammate will be based on the entire flow of the play.
The em dash is the perfect tool for this process. Its flexibility as a syntactic wildcard lets the model bridge clauses or insert ideas without committing to the rigid grammatical rules of a colon or semicolon. This gives it maximum freedom to fulfill the larger, more complex pattern it’s generating. It can connect two thoughts without needing to grasp the precise logical relationship between them, it just needs to know that a connection is statistically likely at that point in the global pattern.
In this sense, the em dash isn’t just a stylistic quirk. It’s a window into the model’s non-human way of assembling language, a process focused on high-dimensional pattern matching rather than the intent and logic that drives human writers. We write with purpose; the AI writes with probability.
Conclusion
The AI’s love affair with the em dash isn’t a single bug, it’s a synergistic outcome of multiple forces converging at the perfect moment:
The echo of 19th-century literature in its training data
The raw computational efficiency of a single token
The subtle “halo effect” of formality from human trainers
This quirky phenomenon is now creating a strange feedback loop. Many human writers are consciously avoiding the em dash for fear their work will be flagged as AI-generated. We’re watching an “anthropomorphic feedback loop” unfold in real time, where human and machine writing styles are actively diverging in response to one another.
The em dash has become a mirror. It reflects the data we fed the machine, the efficiencies we designed into its architecture, and the preferences we unconsciously taught it through our feedback. And now we’re left with a fascinating question: What does it mean when our own sophisticated language tools become markers of the inhuman?
I hope this breakdown helps you understand the strange world of AI language patterns a little better. Remember, whether you’re analyzing basketball plays or punctuation habits, the key is always in the data. Have a fantastic day, and keep your eyes open for those tells. They’re everywhere once you know what to look for!
- Blitz
Top Sources / Citations:
Lumans.ai – https://lumans.ai/en/blog/ai-em-dashes/
Manaknight Digital - https://manaknightdigital.com/blog/have-you-wondered-why-the-ai-model-use-so-many-dashes
Maria Sukhareva, “Let’s talk about em dashes in AI” – https://substack.com/home/post/p-165661070
Melissa Root, Ph.D. – https://red.msudenver.edu/2025/a-lowly-punctuation-mark-has-sparked-a-fiercely-debated-ai-controversy/
Neptune.ai – https://neptune.ai/blog/reinforcement-learning-from-human-feedback-for-llms
Disclaimer: This content was developed with assistance from artificial intelligence tools for research and analysis. Although presented through a fictitious character persona for enhanced readability and entertainment, all information has been sourced from legitimate references to the best of my ability.












