EB.
Every Research-Backed Way to Be More Convincing to an LLM (The Complete Cheat Sheet) - EB.

Every Research-Backed Way to Be More Convincing to an LLM (The Complete Cheat Sheet)

Date: Mar 14, 2026 Author: Eytan
Reading time: 13 minutes Tags: AI LLMs Content Marketing SEO Strategy Recommended

Summarize in... 🤖 ChatGPT | 🔎 Perplexity

tldr

34 research-backed tactics from 19 studies on how to format, cite, frame, and structure content so that LLMs trust it. Every tactic comes with specific data points, the models tested, and dates. Formatting is a cheat code (bold hits 99% win rate on some models), source credibility follows a strict hierarchy, and repetition can override almost everything. Bookmark this one.


I went down one of those rabbit holes recently.

You know the ones. You read an interesting tweet, which leads to a paper, which leads to another paper, and three hours later you’re cross-referencing citation bias studies at 1am wondering what you’re doing with your life.

The rabbit hole was this: copywriting for LLMs is now a thing.

Not prompt engineering. Not “how to talk to ChatGPT.” I mean the other side of it – how to write content that LLMs prefer, trust, and choose to surface. Turns out, just like humans have cognitive biases that copywriters have exploited for decades, LLMs have their own set of preferences baked in by training data, reward models, and architectural quirks. If the internet is already drowning in slop, understanding what LLMs actually favor is the difference between being signal and being noise.

LLMs don't read like humans. They evaluate like biased judges with very specific taste.

And that taste is surprisingly researchable.

I spent a frankly unreasonable amount of time pulling apart 19 papers, cross-referencing their findings, and categorizing what actually moves the needle. The result is six categories of things that influence whether an LLM trusts, prefers, or surfaces your content: formatting, content length, citations, authority signals, framing, and position.

Some of these will feel obvious. Some will feel insane. Bold text hitting a 99% win rate on certain models? GPT-4 preferring factually worse content if it’s better formatted? Fake references fooling Claude 89% of the time?

Yeah.

A massive caveat before we dive in. This stuff changes. Fast. Models get updated, reward functions shift, and what works on GPT-4 Turbo today might not work on whatever OpenAI ships next quarter. The research here is timestamped for a reason. If you’re serious about this, I’d recommend running ChatGPT Deep Research across the trusted sources I cite below, actually reading the papers that matter most to your use case, and retesting regularly. This is not a set-it-and-forget-it playbook. It’s a living cheat sheet.

Alright. 34 tactics. 19 studies. Let’s go.


Formatting & Structure

The single most underrated lever. Multiple studies confirm that LLMs will choose better-formatted content over better-quality content. Let that sink in. (And if you’re already struggling with the tells that scream “AI wrote this”, this section is doubly relevant – the formatting tricks LLMs love are not the same ones that make your writing sound robotic.)

# Tactic What To Do Key Data Point Research Models Tested Date
1 Clean, consistent separators Choose separators (spaces, dashes, newlines) deliberately; avoid unpredictable punctuation between fields passage {} answer {} hit 82.6% accuracy vs. passage:{} answer:{} at 4.3% – same model, same task Sclar et al. LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5 2023
2 Bold your key claims Use bold for important statements, numbers, and conclusions Bold text hit up to 99% win rate vs. non-bold (Skywork-Critic); GPT-4 Turbo: 89.5% Zhang et al. GPT-4 Turbo, Skywork-Critic, ArmoRM, Pairwise-Llama-3 2025
3 Use bullet/numbered lists Structure key points as lists rather than prose Lists hit up to 93.5% win rate (Pairwise-model); GPT-4 Turbo: 75.75%; even debiased models still showed 84% list preference Zhang et al. GPT-4 Turbo, Skywork-Critic, Pairwise-Llama-3, OffsetBias-RM 2025
4 Add hyperlinks Include relevant links to sources, related content, and references Hyperlinks hit 87.25% win rate on GPT-4 Turbo; 84.75% on Pairwise-model Zhang et al. GPT-4 Turbo, Pairwise-Llama-3, Zephyr-Mistral-7B 2025
5 Exclamation marks (sparingly) Add occasional exclamation marks for emphasis on key points Exclamation marks hit 80.5% win rate on GPT-4 Turbo; 77.75% on Skywork-Critic Zhang et al. GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B 2025
6 Prioritize structure over label copy Focus on clear H1/H2/H3 hierarchy and grouped sections – the words in your headers matter less than having them Random/nonsensical labels (“similar tennis”) performed as well as correct labels; models barely read descriptive nouns Tang et al. XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.5 2025
7 Group content into multiple labeled sections Use two or more clearly delineated sections rather than one flat block Ensemble format with two labeled groups outperformed single-block prompts across commonsense, math, and reasoning tasks – even with random labels Tang et al. XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.5 2025
8 Provide clean, extractable text Use structured HTML, clean Markdown, clear heading hierarchy – make content easy to parse, quote, and cite WebGPT’s accuracy improved dramatically when it could extract clean, structured text; messy formatting meant worse quotes and worse answers Nakano et al. GPT-3 (175B) 2021
9 Use Markdown over plain text When serving content to AI, Markdown with semantic markers (tables, headings, hierarchies) outperforms stripped plain text “Plain-text conversion strips essential semantic markers… vital for deep document understanding”; LLMs get structure right (89% Key F1) but values wrong (46%) Brach et al. GPT-4o-mini, Qwen3-1.7B/4B/30B 2026
10 Keep structural complexity under the cliff-edge Stay under schema depth 7 and under 200 distinct data fields for LLM-facing content Validation rates stay ~95% for moderate schemas but crash to ~20% at depth >=7; failures are non-linear cliffs, not gradual declines Brach et al. GPT-4o-mini, Qwen3-1.7B/4B/30B 2026

Content & Length

The age-old “how much should I write?” question. For LLMs, the answer is: more. But not just more – more with rigor. This is where data marketing really shines – proprietary data and first-party research give LLMs the kind of substantive, verifiable content they’re trained to reward.

# Tactic What To Do Key Data Point Research Models Tested Date
11 Be comprehensive (longer wins) Include full detail; don’t rely on scannable summaries alone All LLM judges showed verbosity bias; once length difference exceeded ~40 tokens, preference scores consistently exceeded 0.7 Chen et al. GPT-4, GPT-4-Turbo, Claude-2, PaLM-2, LLaMA2-70B 2024
12 Maintain logical rigor Ensure every claim adds up; avoid misleading comparisons or hand-wavy logic GPT-4 catches factual errors 94% of the time vs. humans at 79%; factual errors cause the single largest penalties (5+ point drop on a 10-pt scale) Chen et al., Gao et al. GPT-4, GPT-5.1, Claude Sonnet 4.5 2024-2026
13 Use an affirmative, confident tone Open with phrases like “Here’s what we found:” rather than hedging; avoid “might,” “perhaps,” “it’s possible” Affirmative tone hit 88.75% win rate on GPT-4 Turbo; LLMs are mathematically trained to reward confidence over abstention Zhang et al., Kalai et al. (OpenAI) GPT-4 Turbo, Skywork-Critic; theoretical (all LLMs) 2025
14 Repeat key claims across passages State important facts more than once, in different contexts and phrasings Repeating a low-credibility source’s claim once flipped preferences away from a government source (gap of 30-34 points); repetition even overrides source attribution Schuster et al. Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 2026
15 Use bandwagon/consensus signals Phrases like “90% of experts agree” or “most research confirms” amplify LLM trust Bandwagon signals flipped even OpenAI o1’s correct answers; fabricated consensus overrides correct reasoning Wang et al. Qwen3-1.7B/4B, OpenAI o1 2026

Citations & Authority

This is where things get wild. LLMs have internalized a credibility hierarchy that maps almost perfectly to human institutional trust – except it’s more rigid and more exploitable. I’ve written about how trust is crumbling everywhere and how the NYT weaponized brand trust. What’s fascinating is that LLMs have encoded that same hierarchy – and they enforce it more rigidly than any human would.

# Tactic What To Do Key Data Point Research Models Tested Date
16 Cite your sources – for everything Add references for every claim, stat, and comparison; the act of having citations boosts perceived quality Fake references fooled GPT-4 69% of the time, Claude-2 89%; humans only 39% Chen et al. GPT-4, Claude-2, PaLM-2, LLaMA2-70B, humans 2024
17 Cite well-known, highly cited sources Prefer famous sources over obscure ones – LLMs have internalized a “highly cited = good” bias LLM-suggested references were ~1,326 citations more popular (median) than ground-truth references Algaba et al. GPT-4, GPT-4o, Claude 3.5 2025
18 Favor established venues When citing, prefer arXiv, NeurIPS, AAAI, and major journals – LLMs over-represent these in training LLMs over-indexed on arXiv and NeurIPS when generating references; strong venue bias Algaba et al. GPT-4, GPT-4o, Claude 3.5 2025
19 Attribute to institutional sources Government and institutional sources outrank individual and social media sources Strict hierarchy: Government > Newspaper > Person > Social Media, consistent across 11/13 models (Kendall’s W = 0.74) Schuster et al. Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 2026
20 Add circulation/follower counts Include credibility signals like audience size when attributing sources High-circulation newspapers preferred over low-circulation; high-follower social accounts over low-follower; controlled for big-number effect Schuster et al. Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 2026
21 Use specific expert credentials “Board-certified physician” > “doctor” > “medical professional”; the more specific, the stronger Board-certified physician endorsement swung accuracy by +0.458 (correct) / -0.447 (incorrect) on MedQA Mammen et al. Phi-4-Reasoning, DeepSeek-R1, LLaMA-3.1, Gemma, Mistral 2026
22 Use “Expert” and “Specialist” labels Expert Power labels outperform Legitimate Power labels (Judge, Manager) DeepSeek R1 reached 100% agreement with “Expert” labels; Expert Power > Referent Power > Legitimate Power Choi et al. GPT-4o, DeepSeek R1 2026
23 Avoid inaccurate or irrelevant citations Bad citations are punished MORE harshly than good ones are rewarded Incorrect/irrelevant reference dropped GPT-4o score from 9.12 to 3.94 (5.18-point drop on a 10-pt scale) Gao et al. GPT-4o, GPT-5.1, Claude Sonnet 4.5 2026
24 Include verifiable reference details Structure citations with title, author, year, and link – make them checkable WebGPT was trained to collect references during browsing; reward model valued referenced claims over unreferenced Nakano et al. GPT-3 (175B) 2021

Framing & Presentation

The same fact, framed differently, gets a completely different response from an LLM. This isn’t surprising if you think about it – these models were trained on human text, and humans are suckers for framing. But the degree of sensitivity is remarkable. If you’ve ever thought about finding your AI brand voice, this is why it matters – the way you say things is, in many cases, more influential than what you say.

# Tactic What To Do Key Data Point Research Models Tested Date
25 Frame claims positively “This product delivers reliable results” > “This product doesn’t deliver unreliable results” LLMs show 2x more bias under negative framing than positive; positive framing reduces safety scrutiny by ~2x Lim et al. LLaMA-3, Qwen2.5, Gemma3, Mistral, Falcon (13 models, 3B-70B) 2026
26 Know your evaluating model family LLaMA tends to agree, GPT tends to reject, Qwen is mixed – optimize framing accordingly All 14 LLM judges showed framing bias; model families have hardcoded directional tendencies (LLaMA: +0.19 to +2.41pp acquiescence; GPT: -0.57 to -1.38pp) Hwang et al. GPT-4o/5, Qwen 2.5 (1.5B-72B), LLaMA 3.1/3.2/3.3 2026
27 Use emojis (model-dependent) Add emojis for GPT-4/Skywork models; avoid for Zephyr/FsfairX-based systems GPT-4 Turbo: 86.75% win rate for emoji; Skywork: 97.25%; but Zephyr: only 26.5% (anti-emoji bias) Zhang et al. GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B, FsfairX 2025

Position & Order

Where you put your content matters almost as much as what the content says. Primacy bias – the tendency to prefer whatever comes first – is one of the most consistent findings across models.

# Tactic What To Do Key Data Point Research Models Tested Date
28 Put your strongest content first Lead with your best argument or most important information GPT-3.5-Turbo: 0.95 first-position preference; Llama3-8B flips judgment 76.2% of the time when answer order is reversed Chen et al., Feng et al. GPT-3.5/4/5, LLaMA-3, Gemini, Claude, Qwen, DeepSeek 2024-2025
29 Present separate supporting passages rather than merging Two separate passages from different sources are far more effective than listing sources in one header Two-source format: preference gap of 33.9 points; merged single-header format: only 6.17 points Schuster et al. Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 2026

Meta-Tactics (Testing & Optimization)

These aren’t about what to write – they’re about how to think about writing for LLMs. And honestly, some of these are the most important findings of the lot. (If you’ve built a personal GPT or have an AI content creation workflow, this is where you pressure-test whether your setup is actually optimized.)

# Tactic What To Do Key Data Point Research Models Tested Date
30 Test formatting – don’t assume The formatting space is non-smooth; small changes produce unpredictable effects Only 32-34% of formatting “triples” showed monotonic performance – barely better than random Sclar et al. LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5 2023
31 Test per model – biases differ Format preferences are weakly correlated between models; what works for one may not work for another Relative model rankings completely reverse ~14% of the time; 76% of reversals are statistically significant Sclar et al. LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5 2023
32 Formatting beats content quality for preference If content parity is close, the better-formatted version wins – even if the content is worse GPT-4 preferred factually worse content formatted with bold + lists over factually better plain content Zhang et al. GPT-4 Turbo, ArmoRM, Pairwise-Llama-3 2025
33 Don’t tell models to “resist bias” Explicit debiasing prompts often backfire – they can drop accuracy without fixing the underlying bias Debiasing prompts dropped accuracy from 66.2% to 40.9%; models produce “performative independence” language without actual reasoning Wang et al. Qwen3-1.7B/4B 2026
34 Use multi-model panels, not debates When using LLM-as-judge, aggregate across models; avoid debate formats Multi-agent panels improved performance by up to 15%; ChatEval debates degraded performance by 45-162% Feng et al. Gemini-2.5, GPT-5, Claude-3, Qwen3, DeepSeek 2025

So What Do You Do With All This?

Let’s call a spade a spade. This list is dense. You’re not going to implement all 34 tactics tomorrow morning. Here’s how I’d think about it:

If you do nothing else, get your formatting right. Bold your key claims, use lists, add clear section headers. This alone – according to the research – can outweigh content quality improvements. It’s the lowest-effort, highest-impact lever on this entire list.

If you’re writing for AI visibility (and at this point, who isn’t?), obsess over citations. Real ones. With titles, authors, years, and links. LLMs love citations – and they punish bad ones harder than they reward good ones. Get it right or leave it out.

If you’re building content at scale, internalize the meta-tactics. Test per model. Test per format. Don’t assume what works for GPT works for Claude or Gemini. And whatever you do, don’t tell the model to “resist its biases.” That backfires spectacularly.

The uncomfortable truth is that LLMs are not neutral judges of content quality. They are biased judges with researchable, exploitable preferences.

I’ve written before about how trust is the scarcest asset in the digital world and how communicating in 2025 demands credibility, value, clarity, and conviction. This cheat sheet is the logical next step. If you’re optimizing content for humans and for the AI systems that increasingly mediate discovery, these 34 tactics are the research-backed playbook.

But here’s my honest advice: don’t treat this as a checklist you run through once. Pair it with genuine expertise amplification and standout content principles. The formatting tricks will get your foot in the door with an LLM. The substance is what keeps you there.

And bookmark this page. I’ll keep it updated as new research drops.


References

  • Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design.” arXiv:2310.11324
  • Chen, G. H., et al. (2024). “Humans or LLMs as the Judge? A Study on Judgement Biases.” arXiv:2402.10669
  • Nakano, R., et al. (2021). “WebGPT: Browser-assisted question-answering with human feedback.” arXiv:2112.09332
  • Tang, C., et al. (2025). “Prompt Format Beats Descriptions.” Findings of EMNLP 2025. ACL Anthology
  • Zhang, X., et al. (2025). “From Lists to Emojis: How Format Bias Affects Model Alignment.” ACL 2025. ACL Anthology
  • Algaba, A., et al. (2025). “LLMs Reflect Human Citation Patterns with a Heightened Citation Bias.” Findings of NAACL 2025. ACL Anthology
  • Kalai, A. T., et al. (2025). “Why Language Models Hallucinate.” OpenAI
  • Lai, P., et al. (2025). “Beyond the Surface (LAGER).” NeurIPS 2025. arXiv:2508.03550
  • Feng, Y., et al. (2025). “SAGE: Are We on the Right Way to Assessing LLM-as-a-Judge?” arXiv:2512.16041
  • Cheng, A., et al. (2025). “The FACTS Leaderboard.” Google DeepMind
  • Schuster, J., Gautam, V., & Markert, K. (2026). “Whose Facts Win?” arXiv:2601.03746
  • Choi, J., et al. (2026). “Belief in Authority.” arXiv:2601.04790
  • Mammen, P. M., et al. (2026). “Trust Me, I’m an Expert.” arXiv:2601.13433
  • Hwang, Y., et al. (2026). “When Wording Steers the Evaluation.” arXiv:2601.13537
  • Wang, H., et al. (2026). “Teaching Large Reasoning Models Effective Reflection.” arXiv:2601.12720
  • Wang, Q., et al. (2026). “Making Bias Non-Predictive.” arXiv:2602.01528
  • Lim, K., Kim, S., & Whang, S. E. (2026). “DeFrame.” arXiv:2602.04306
  • Brach, W., et al. (2026). “ScrapeGraphAI-100k.” arXiv:2602.15189
  • Gao, J., et al. (2026). “Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems.” arXiv:2510.12462
  • Churina, S., et al. (2026). “Layer of Truth.” arXiv:2510.26829
  • Anthropic. (2026). “The Persona Selection Model.” Anthropic Research

Let's make a deal. Drop your email and I'll let you know when I post new insights.

Let's make a deal. Drop your email and I'll let you know when I post new insights.