How AI engines decide what to cite — and what it means for your site

When someone asks Perplexity “what is the best CRM for small businesses” and Perplexity surfaces three sources, how did those three sites get picked? Not by ranking on Google. Not by domain authority in the traditional sense. By a combination of signals that are meaningfully different from classic SEO. This guide walks through what those signals are and what you can do about them on WordPress.

How RAG works (and why it matters for citation)

Most AI search engines use a technique called Retrieval-Augmented Generation (RAG). The process works roughly like this:

A user asks a question
The engine retrieves relevant content from the web (or a cached index)
The retrieved content is passed to a language model as context
The language model generates a response, drawing on the retrieved content
The engine attributes the generated content to the sources it used

The critical point is step 2: which content gets retrieved? That decision is made by a retrieval system, not a language model. The retrieval system has its own ranking criteria, and those criteria are where GEO signals live.

What retrieval systems look for

Semantic relevance

The most fundamental signal is semantic relevance: does this content actually answer the query? AI retrieval systems embed both the query and candidate documents into a shared vector space and compare them. Content that is topically precise and uses the same language as the likely query will score higher.

Practical implication: meta descriptions and introductions that lead with a direct answer to the most likely query improve semantic match. “The best CRM for small businesses in 2026 is [X], because...” is more retrievable than “In this guide, we explore CRM options for growing businesses.”

Structured data signals

Structured data — particularly JSON-LD schema — gives retrieval systems pre-chunked information that is directly usable in a generated response. FAQ schema is the clearest example: each question-answer pair is a ready-made citeable unit. Retrieval systems that understand schema markup can pull individual FAQ items and attribute them to the source page.

Article schema adds provenance: who wrote this, when was it published, when was it last updated. Recency and authorship are signals in their own right — particularly for queries where freshness matters.

Content density and specificity

AI retrieval systems tend to favour content that is specific over content that is general. A post that answers “how to set up Rank Math with a multilingual WordPress site” will outperform a generic post about WordPress SEO, even if the generic post is longer and has more backlinks.

This is a meaningful inversion from traditional SEO, where topical breadth and high word count were often rewarded. For GEO, depth and specificity on a narrow question is more valuable than comprehensive coverage of a wide topic.

Crawlability

This is basic but frequently overlooked: AI engines can only cite content they can access. If your robots.txt blocks GPTBot, your content doesn't get indexed by ChatGPT Search. If it blocks ClaudeBot, it doesn't get into Anthropic's retrieval system. Many WordPress security configurations or aggressive caching plugins inadvertently add these blocks.

An explicit Allow: / directive for each major AI crawler is the first GEO fix to make on any WordPress site.

LLM-readable content structure

Beyond structured data markup, the way your content is written affects retrievability. Content that follows a consistent structure — clear headings, short paragraphs, answer-first opening sentences — is easier for language models to chunk, parse, and attribute. Dense prose without clear structure is harder to quote accurately.

llms.txt and llms-full.txt contribute here too: they give AI crawlers an explicitly structured content inventory, reducing the inference required to understand what's on a site and where the relevant content lives.

Different engines, different approaches

Perplexity

Perplexity is probably the most citation-forward of the major AI engines. It surfaces sources prominently, often pulls from meta descriptions verbatim, and uses a retrieval approach that rewards structured, specific content. FAQ schema and answer-first meta descriptions have noticeable impact on Perplexity citation rates.

ChatGPT Search

ChatGPT Search (the Bing-integrated search capability in GPT-4 and above) uses a combination of Bing's web index and OpenAI's own crawling. GPTBot explicitly crawls the web for training and retrieval data. Sites that block GPTBot are invisible to ChatGPT Search.

Gemini

Google's Gemini integrates with Google Search, meaning many traditional SEO signals carry over. However, Google's AI Overviews (formerly SGE) use a distinct ranking layer on top of organic results. Structure data — particularly FAQ and HowTo schema — has been consistently linked to higher AI Overview inclusion rates.

Claude

Anthropic's Claude uses ClaudeBot for web crawling in Claude.ai's web-connected mode. ClaudeBot crawls content similarly to other AI crawlers and responds to robots.txt directives.

What this means practically

If you take one thing from this guide, it's this: AI engine citation is not a byproduct of good traditional SEO. It is a separate outcome that requires separate signals. A site that ranks well on Google is not automatically well-positioned for AI citation — and vice versa.

The signals that matter most:

Explicit AI crawler permissions in robots.txt
FAQ schema on content-rich pages
Answer-first meta descriptions
Article schema with accurate publication dates
llms.txt for site-level discoverability
Clear, specific content structure

For WordPress sites with substantial back-catalogues, implementing these signals at scale is the core challenge. Manual implementation post-by-post is feasible for small sites. For anything over 30–40 posts, automated tooling is the practical answer.