Can You Predict Which Webpages an LLM Will Cite? We Tested It.

The rules of digital visibility are changing fast. A few years ago, the goal was to rank on page one of Google. Today, millions of users skip the search results page entirely and ask ChatGPT, Gemini, Perplexity, or Claude directly. The model reads the web, synthesizes an answer, and cites a handful of sources.

Your content either makes that shortlist, or it doesn't.

At elelem, we asked a question that no one in the GEO (Generative Engine Optimization) space had answered rigorously: can we build a score that predicts, ahead of time, how likely a given webpage is to be cited in an LLM-generated response?

We built one. Then we validated it against 140,000 real LLM queries. Here is what we found.

The Challenge: Citation Is Not Just About Relevance

Before we get to the score, it helps to understand why predicting LLM citations is hard.

Traditional SEO is relatively legible. Google's ranking signals, while complex, are reasonably well-studied. You can measure backlinks, page speed, keyword density, and get a directional sense of where you stand.

LLM citation works differently. A model like ChatGPT or Gemini operates through a retrieval-augmented-generation pipeline. Documents are first retrieved as candidates, then the model selects a small subset to cite within a constrained context window. That selection is influenced by factors across at least four layers:

  • Document-level factors: how semantically relevant, factually dense, and structurally extractable the content is.
  • Domain and infrastructure factors: source authority, brand familiarity to the model, crawlability, and recency signals.
  • Retrieval system factors: index inclusion, chunking compatibility, and competition among retrieved candidates.
  • Generation and citation policy factors: cross-source consensus, citation slot scarcity, and model-specific diversity rules.

There are over 100 variables that potentially influence whether your page gets cited. Most of them are either outside your control or impossible to measure directly.

The elelem Retrieval Score focuses on the layer where you have the most leverage: document-level factors.

What the Retrieval Score Measures

The score takes a query and a webpage as input and produces a single number estimating the likelihood of citation. It is built from four families of signals, each capturing a distinct dimension of how well a document serves a given query.

The four signal families are: semantic relevance, contextual relevance, lexical overlap, and query token positioning.

Each signal is computed independently and combined into a composite score. The weights assigned to each component are not arbitrary -- they are derived empirically from how strongly each signal correlates with actual citation outcomes in our dataset. The weighting scheme is updated as we collect more data.

Without disclosing the precise implementation, the core intuition is this: a webpage that is deeply aligned with the meaning of a query, answers it directly, uses the right vocabulary, and surfaces that vocabulary early in the document, is more likely to be cited than one that does not.

The Validation Study

To test whether the score actually predicts citation outcomes, we ran a correlation study across 140,414 LLM requests, spanning four providers: ChatGPT, Gemini, Perplexity, and Claude.

Five scatter plots displaying data trends with varying colors and axes, labeled in Hebrew.

The target variable was share of citation: the fraction of times a given URL was cited within a query group, normalized per provider to control for differences in how many citations each model tends to include per response.

Key results:

  • Every component of the retrieval score was a statistically significant predictor of citation share (p < 0.001 across all groups).
  • The composite retrieval score achieved Pearson correlations up to r = 0.27 and medium effect sizes under ChatGPT and Claude.
  • Correlations were consistently stronger when analyzed per-provider than in the aggregate, suggesting provider-specific citation behaviors that the score captures well within each context.

Why aren't the correlations higher? Because they shouldn't be. As outlined above, citation outcomes are determined by document-level factors and system-level factors we deliberately do not attempt to model. A score that claimed to perfectly predict citations would be lying. Moderate, consistent, statistically significant correlations across 140K observations and four independent providers is exactly the signal you want from a document-level optimization tool.

What This Means for Your Content Strategy

The practical implication is straightforward. If you are a brand trying to appear in LLM-generated answers, you now have a measurable, quantitative signal to optimize against -- rather than guessing.

elelem's GEO Platform surfaces this score directly in the Score Draft feature within the Optimize Content section. For each of your webpages and a given target query group, you get a retrieval score and actionable guidance on what is holding the score down and how to address it.

The score is not a guarantee of citation. But it is the most empirically grounded signal currently available for document-level GEO.

Screenshot of a user interface showing progress bars, scores, and feedback for performance evaluation.

What We Are Building Next

This is version one of the retrieval score. The current implementation operates at the document level. Upcoming improvements include:

  • Passage-level scoring: moving from whole-document analysis to chunk-level cross-encoder scoring, which is expected to capture finer-grained relevance signals.
  • Factual density and structural extractability metrics: quantifying how information-rich and machine-parseable a document is -- two factors the literature identifies as important but which are difficult to operationalize.
  • Ablation studies: systematically evaluating the incremental contribution of each component to understand interactions and guide future weight calibration.

We are also exploring the use of industry-specific corpora for the lexical overlap component, after observing that domain-general corpora produces weaker signals for specialized verticals.

If you are working on your brand's visibility in LLM-generated responses and want to see the Retrieval Score in action, get in touch with the elelem team.