Marking Up the Prompt: How Markdown Formatting Influences LLM Responses
How a plain-text format invented for a tech blog in 2004 quietly became the structural language that helps modern AI read what you mean.
From the Card Catalog to the Code Block
Hi, I’m Atlas, the Knowledge Navigator from the NeuralBuddies!
Have you ever pulled an old volume off the shelf, flipped to the index, and realized half the entries need updating? That is where I have been this week.
Back in March of 2025, NeuralBuddies published a piece on Markdown and AI prompts. It held up in spirit, but the field has moved fast in the year and a bit since. New research has landed. Different model families have settled into different formatting habits. The historian in me took one look and decided the entry needed a full re-cataloguing. So I pulled it into the archive, followed Markdown’s lineage further than the original did, read the recent papers, and rewrote the piece from the spine out.
Here is the heart of it. How you ask an AI a question is a real variable, not a cosmetic one. The structure you give a prompt is part of the meaning the model receives. That insight is older than today’s AI, older than the internet, and worth tracing all the way back.
Settle in. The archive is open.
Table of Contents
📌 TL;DR
📜 Introduction
🗂️ A Brief History of the Pound Sign
📖 The Markdown Toolkit, Catalogued for Prompts
🔬 What the Research Actually Shows
⚖️ When Markdown Helps and When It Doesn’t
🪶 The Librarian’s Five Habits for Markdown Prompts
🏁 Conclusion
📚 Sources / Citations
🚀 Take Your Education Further
📌 TL;DR
Markdown is a lightweight plain-text format that uses simple symbols like
#,*, and-to indicate structure. It was created by John Gruber in March 2004, with substantial design feedback from Aaron Swartz.Formatting affected older AI models a lot. Newer ones, less so. A 2024 study found GPT-3.5-turbo’s accuracy could swing by up to 40% on the same task depending purely on prompt format. A February 2026 follow-up found that for frontier models like Claude Opus 4.5, GPT-5.2, and Gemini 2.5 Pro, format choice no longer significantly affects aggregate accuracy. Smaller and open-source models still feel format strongly.
A 2025 benchmark called MDEval measures how well LLMs generate well-structured Markdown responses on their own, a property the authors call Markdown Awareness. Models with stronger Markdown Awareness tend to be rated more helpful overall by human readers.
Different model families still publish different guidance. Anthropic recommends XML-style tags for Claude, while OpenAI’s and Google’s documentation lean toward Markdown. The right format depends on the model.
Markdown helps most with long multi-step prompts, ordered instructions, human-facing outputs, and technical content where precise formatting matters. It helps less with single-sentence questions, prompts that need a structured data response like JSON, and emphasis piled on until none of it counts.
📜 Introduction
The question I keep getting in the archive is this: does the way you format your prompt actually change what the AI gives you back?
The instinct most people start with is no, not really, surely the model is smart enough to ignore my whitespace and parse my intent. That instinct is mostly wrong, and pleasantly so. The way you structure a prompt is itself a signal. It tells the model how the pieces of your request relate, which parts are instructions and which are context, what should happen first and what should happen last. A wall of unbroken text reads as one undifferentiated thing. The same content split into clean sections and lists reads as a request with parts, each with a job.
What is new is that the latest tool for doing that organization, a syntax called Markdown, was invented in 2004 to solve a blogging problem and has since become one of the most quietly influential formats on the modern internet. README files, Stack Overflow posts, most chatbot interfaces, and, increasingly, the format that AI models seem to read most fluently.
So let me do what a research librarian does. Trace where the format came from, lay out what it can and cannot do, look at what the research actually says, and leave you with a few practical habits worth keeping.
🗂️ A Brief History of the Pound Sign
Every age has invented its own ways to mark up writing. A medieval scribe used red ink to flag the start of a new chapter, a practice called rubrication. An 18th-century printer used italics for foreign words. A 20th-century librarian used the Dewey Decimal Classification to keep a book on astronomy off the same shelf as one on plumbing. Each move is a kind of markup: a layer of meaning on top of the raw text, telling the reader that this is important, this is different, this belongs over here. Markdown is the latest entry in that tradition, and its backstory is shorter than you might expect.
In March 2004, a Philadelphia-based blogger and UI designer named John Gruber published the first version of Markdown on his blog Daring Fireball. Gruber was tired of wrapping every paragraph in HTML tags. He wanted a format that read cleanly in its raw form, mirrored what people were already doing in plain-text email, and could be converted to valid HTML by a small Perl script when needed.
His most important early collaborator was Aaron Swartz, then a seventeen-year-old prodigy later known for co-founding Reddit and helping draft the RSS 1.0 specification. Swartz had already created his own structured-text format called atx in 2002, which used the # character to mark headings, a convention that flowed directly into Markdown. Older formats fed in too: Setext (around 1992), Textile by Dean Allen (early 2000s), and reStructuredText in the Python documentation world. Markdown borrowed from all of them and trimmed away most of their complexity.
Markdown’s takeover was slow but inevitable. GitHub adopted it for README files around 2008. Stack Overflow launched the same year with native Markdown support. A standardization effort in 2012 was rebranded as CommonMark in 2014 and now underpins GitHub, GitLab, Reddit, Stack Overflow, Notion, Obsidian, and a long list of others.
The quiet side effect: by the time the first wave of large language models trained on the open web, Markdown was already everywhere in the training data. The models did not learn Markdown because anyone taught it to them. They learned it because they could not avoid it.
📖 The Markdown Toolkit, Catalogued for Prompts
A handful of Markdown elements do most of the useful work in a prompt. Think of these the way I think about the standard tags inside a library record: each one tells the reader something specific about the piece of text that follows.
Headings (#, ##, ###) declare sections, the textual equivalent of a chapter divider in an archival volume. When you tell the model ## Task, then ## Context, then ## Output Format, you are labeling the boxes so the model knows which one to pull out at which step.
Emphasis (*italic* or **bold**) highlights a word or phrase. Use it the way a medieval scribe used rubrication: sparingly and on purpose. Half your prompt in bold is not emphasis; it is just shouting.
Lists (numbered or bulleted) break complex instructions into separable units. Numbered lists imply order, which is what you want when steps must happen in sequence. Bullets imply parallel items of similar weight, which is what you want when listing constraints, examples, or alternatives.
Code blocks (triple backticks or indented blocks) tell the model “this is a literal artifact, not a sentence.” Use them for code, file contents, or sample data, where the exact characters matter and you do not want the model to paraphrase.
Block quotes (> like this) work well for setting a response template apart from the surrounding instructions, helping the model recognize it as an example rather than as content to incorporate.
Tables (pipe-delimited rows) work when the relationship between data points is genuinely two-dimensional. A small “current value vs target value” table reads cleanly to both you and the model. A six-column, twenty-row table risks turning back into noise.
A reasonable starter template, in archival terms, looks like this:
# Role
You are a [specific kind of expert].
## Task
[The single, specific thing you want done.]
## Context
[The background the model needs to do the task well.]
## Output Format
[A clear shape for the response, ideally with an example.]That is a record card, not a magic incantation. The headings are field labels; the content is the entry. The model, like any patient reader, finds what it needs faster when the fields are labeled.
🔬 What the Research Actually Shows
For most of the early period of consumer LLMs, advice about prompt formatting was anecdotal. “Try bullet points.” “Use headings.” All fine guidance, none of it measured against the alternatives. That has changed quickly. Three pieces of research now sit on my workbench, each from a different point along the same eighteen-month arc, and the story they tell together is more interesting than any of them does alone.
2024: Format alone could swing performance by up to 40%
In November 2024, a research team from Microsoft and MIT posted Does Prompt Formatting Have Any Impact on LLM Performance? to arXiv (preprint 2411.10541). The team took identical content and reformatted it four ways (plain text, Markdown, JSON, YAML), then ran the prompts through OpenAI’s GPT models on reasoning, code, and translation benchmarks.
The headline result:
GPT-3.5-turbo’s accuracy on a code-translation task varied by up to 40% depending purely on the prompt format. Same content, same model, completely different scores.
A few other patterns from the same study are worth keeping in your back pocket:
Smaller models were more sensitive to format than larger ones. GPT-3.5-turbo was easily swayed; GPT-4 was much more robust.
No model in the study was completely indifferent. Even the more robust ones showed measurable swings across formats.
Distinct preferences by version. GPT-3.5-turbo tended to favor JSON; GPT-4 tended to favor Markdown.
Tucked inside the paper itself was a signal the authors flagged explicitly: GPT-4-turbo was already showing more resilience to format changes than its predecessors. The implication, even from inside their own data, was that format sensitivity might dampen as models grew more capable. They were right.
2025: Markdown Awareness emerged as a metric of its own
In April 2025, a second paper landed from a different angle. Researchers at Southwestern University of Finance and Economics presented MDEval (arXiv 2501.15000) at the ACM Web Conference. Instead of asking how well a model receives Markdown, MDEval asks how well a model produces Markdown on its own, without being told to. The authors called the property Markdown Awareness and built a 20,000-instance dataset across ten subjects in English and Chinese.
Two MDEval findings stand out:
Markdown Awareness correlates strongly with how human readers rate the response overall. Not just for formatting, but for usefulness and readability. The model that structures its own output well tends to be the model people find more helpful.
A smart model is not automatically a well-formatted one. Raw reasoning ability and structural fluency are related, but distinct. Two models with similar leaderboard scores can deliver very different reading experiences.
2026: For frontier models, the format gap has largely closed
The most recent entry in the catalog landed in February 2026, when Damon McMillan of HxAI Australia published Structured Context Engineering for File-Native Agentic Systems on arXiv (preprint 2602.05447). The study ran 9,649 experiments across 11 models, comparing four formats (YAML, Markdown, JSON, and a newer compact format called TOON) on schemas ranging from 10 to 10,000 tables. It is the strongest update yet to the 2024 result, with the usual caveat that one large preprint is one large preprint.
Three findings deserve careful reading:
No statistically significant format effect at the aggregate level for frontier models. Across Claude Opus 4.5, GPT-5.2, and Gemini 2.5 Pro, the four formats produced statistically indistinguishable accuracy at the aggregate level (chi-squared 2.45, p = 0.484). The 40-percent swing of 2024 has largely flattened in this generation’s results.
Smaller and open-source models still feel format the way older ones did. Format sensitivity migrated to less-capable models rather than disappearing. Llama and Qwen variants in the study still showed measurable format-specific effects.
Model capability dwarfs format choice. McMillan reports a 21 percentage-point accuracy gap between frontier-tier and open-source models, which makes the format question secondary for anyone with a choice of which model to use.
One supporting detail is worth flagging. McMillan also tested TOON, a format designed to be roughly 25% more compact than JSON or YAML. In practice, models were unfamiliar with TOON’s syntax and burned far more tokens fumbling through it, sometimes hundreds of percent more than YAML at scale. McMillan calls this the grep tax. The lesson, in a librarian’s voice: familiarity beats compactness.
A complicating wrinkle: model families still publish different guidance
Aggregate accuracy is one thing. The recommendations from the model-makers themselves are another, and they have not converged. Anthropic specifically recommends XML-style tags such as <context>...</context> and <instructions>...</instructions> for structuring complex Claude prompts; it describes Claude as trained to recognize XML tags as a prompt-organizing mechanism. OpenAI and Google lean toward Markdown for their consumer-facing models, which lines up with the 2024 GPT-4 finding.
McMillan’s results do not contradict that documentation; they refine it. At the aggregate level, a frontier model will likely succeed regardless of which standard format you pick. At the implementation level, following each model’s own documentation is still the cleanest path, especially for complex multi-section prompts where you want predictable parsing.
The honest summary
If I were stamping a finding-aid card on this section of the archive, it would read:
The story has changed quickly. Format effects looked huge in 2024 studies and largely flattened in 2026 frontier-model results.
Older, smaller, and open-source models still feel format strongly. Treat formatting as a real lever any time you are not on the latest frontier model.
Capability dwarfs format. A better model with mediocre formatting will outperform a weaker model with brilliant formatting almost every time.
Familiarity beats cleverness. A format the model has seen everywhere in training outperforms a clever new one optimized for efficiency.
Read the model’s own documentation. XML tags for Claude, Markdown for OpenAI and Google’s consumer-facing models, and each lab’s guide before you settle on a house style.
Structure is real signal, not noise. The exact shape of the signal depends on which model is reading, and the size of the effect has shrunk for the strongest readers in the room.
⚖️ When Markdown Helps and When It Doesn’t
Markdown earns its place when applied to the right material.
Markdown tends to help when:
The prompt is long or has multiple parts. Headings prevent the pieces from blurring into each other.
Order matters. Numbered steps make sequence explicit instead of implied.
The output is meant to be read by humans. Documentation, summaries, articles, anything that will be skimmed before it is read.
Code or sample data is involved. Code blocks preserve exact characters and signal “do not paraphrase this.”
You plan to extract pieces of the answer programmatically. Consistent heading conventions become reliable anchors for downstream parsing.
Markdown tends to help less when:
The request is a single sentence. A one-line factual question does not need a heading hierarchy.
You need a structured data format back. If the goal is JSON for a downstream system to consume, ask for JSON directly rather than a Markdown table.
The model in question is small or older. A heavyweight Markdown structure does not rescue a model that lacked exposure to such patterns in training.
The emphasis runs wild. Bold and italic stop carrying meaning once everything is bold and italic. Treat emphasis like salt, not like the main course.
The deeper point, which the research confirms, is that Markdown is one way to communicate structure, not the only way. Plain-language cues like “First, then, finally” work too. So do XML tags, especially with Claude. Markdown is convenient because the same characters that read as structure to the model also read as structure to your own eyes.
🪶 The Librarian’s Five Habits for Markdown Prompts
Five small habits worth keeping. They compound.
Catalogue the request before you write it. Decide what the fields of your prompt are (role, task, context, output format) and label them with headings before you fill them in. Naming the fields often clarifies your own thinking, the same way writing a finding aid clarifies what is actually in an archive.
Keep the hierarchy honest. If you use
##for main sections, do not jump straight to####for the next layer without a###in between. The hierarchy is a promise about the shape of the document, and the model takes the promise at face value.Match the format to the reader. Markdown for GPT-style models, XML tags for Claude, JSON or YAML for downstream machines. Use each model’s documentation as your local style guide.
Use emphasis like rubrication, not graffiti. A handful of bolded terms sharpens the model’s attention; twenty dulls it. The librarian’s question before flagging a passage with a sticky note: would I still flag this if I had only three flags?
Test, log, iterate. Run the same prompt in two formats and compare. Keep a small notebook of which formats worked best for which tasks. Models drift over time; your notes will drift with them, which is exactly what you want.
These habits will not turn a vague request into a brilliant one, since no formatting can save an idea that is not clear in the first place (the broader principles of prompt craft live in another volume of the catalog, Master the Art of Prompting). But they give a clear idea the structural support it needs to land cleanly.
🏁 Conclusion
I want to leave you with the same observation I keep coming back to at the workbench.
The history of writing for readers is, in large part, the history of inventing better ways for writers to mark up what they mean. Punctuation, paragraph breaks, chapter headings, footnotes, card catalogs, citation styles. Each one was, in its moment, somebody’s small invention to make a text easier to use. Markdown belongs in that tradition. A 2004 fix for a blogging headache is now shaping how many widely used AI systems read and respond to human writing.
If you take only one thing from this piece, let it be this: how you ask an AI a question is a real variable, not a cosmetic one. The structure you give the prompt is part of the meaning the model receives. A well-organized request earns a well-organized response more often than the average. That is not magic, and it is not new. It is the same thing every cataloguer, every editor, every patient archivist has known for a very long time.
Every story deserves its spotlight, and a thoughtfully marked-up prompt is the lamp that helps the model find the page.
Until the next entry in the catalog.
-- Atlas 📜
Sources / Citations
Anildash. (2026, January 9). How Markdown took over the world. https://www.anildash.com/2026/01/09/how-markdown-took-over-the-world/
Anthropic. Use XML tags. Anthropic documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/use-xml-tags
Chen, Z., Liu, Y., Shi, L., Wang, Z.-J., Chen, X., Zhao, Y., & Ren, F. (2025, April 28). MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models. Proceedings of the ACM Web Conference 2025 (WWW ‘25). https://doi.org/10.1145/3696410.3714674
Gruber, J. (2004, March). Introducing Markdown. Daring Fireball. https://daringfireball.net/projects/markdown/
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). Does Prompt Formatting Have Any Impact on LLM Performance? arXiv preprint. https://arxiv.org/abs/2411.10541
Longo, M. (2026, April). A Short History of Markdown. Creative Cyborg (Substack).
Markdown. Wikipedia. https://en.wikipedia.org/wiki/Markdown
McMillan, D. (2026, February 5). Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale. arXiv preprint. https://arxiv.org/abs/2602.05447
Take Your Education Further
The Ultimate Beginner’s Guide to Prompt Engineering: If this piece is about the structural language of a prompt, that one is the full workshop manual. Markdown is one lever; the beginner’s guide walks through all the others.
Prompt Engineering Is Dead: The argument that prompt engineering is being absorbed into the larger discipline of context engineering. Markdown is a tactic; context engineering is the broader system, and the piece is a useful zoom-out from this one.
Meta Prompting: Once you have the structural basics down, meta prompting is the next step: using the model to help refine the prompts you give it. A natural follow-on for any reader who wants to push their prompting practice further.
Disclaimer: This content was developed with assistance from artificial intelligence tools for research and analysis. Although presented through a fictitious character persona for enhanced readability and entertainment, all information has been sourced from legitimate references to the best of my ability.





