Every AI Visibility Tool Is Lying to You

Arber Xhindoli · June 30, 2026 · 12 min read

open.canonry.aiGitHub

A new category of software promises to tell you how visible your brand is inside ChatGPT, Claude, Gemini, Perplexity, and Google's AI answers. It gives you a dashboard with a number on it: mention rate, citation rate, share of voice, maybe even a rank.

You are number four in your category. You moved up two spots this week. Your competitor is shown at 31% visibility. You are shown at 17%.

That is the lie.

Not because there is no signal. There is signal. Not because the vendors are all bad actors. Most of them are trying to measure something genuinely important. The lie is the precision. They are selling a clean number for a system that is noisy, personalized, geographic, nondeterministic, and constantly changing underneath them.

And in the SaaS category, the mechanism is usually worse than the dashboard admits. If a tool claims to show you "what customers see" in ChatGPT or Claude, the only outside path is to automate the consumer frontend and scrape the answer. If it is not doing that, it is usually hitting the API, which is not the same product your customer uses. Either way, you are looking at a synthetic probe and pretending it represents reality.

It does not.

The frontend scrape problem

Scraping the ChatGPT or Claude frontend sounds persuasive at first. The vendor can say, truthfully, that it is not just calling a developer endpoint. It is opening the app, asking the question, and recording what the product returns.

That is closer to the surface a real user sees. It is still not the same thing as measuring the real user population.

A scrape is one account, or a pool of controlled accounts. It has one history state, one memory state, one subscription tier, one geography, one browser session, one abuse-detection posture, and one prompt phrasing. The answer can change if any of those change. A real buyer asking "best CRM for a seed-stage startup" is not the same instrument as a clean synthetic browser asking "best CRM software" from a datacenter IP.

Mass scraping makes the bias worse for structural reasons. At any meaningful volume, the work has to run from somewhere: cloud machines, proxy routes, managed browsers, headless sessions, or some other automation layer. That does not prove how any specific vendor operates. It means the measurement inherits infrastructure artifacts that real user demand does not have: concentrated IP patterns, repeated login behavior, unusual session rhythms, rate-limit pressure, and possible anti-abuse handling from the AI product itself. The operator then has to decide whether to use clean accounts with no memory, which are repeatable but unlike customers, or aged accounts with history, which are more realistic but no longer controlled. A benchmark account that asks thousands of category prompts also creates its own synthetic personalization trail. After a while, the account is not a neutral buyer; it is an account whose entire life is benchmark traffic.

This matters most for local and commercial prompts. "Best commercial roofing company near me" is not a universal question. "Best AEO agency in NYC" is not a universal question. The answer depends on where the user is, which retrieval system fires, what the product knows about the account, and what fresh sources it pulls at that moment.

Scraping one frontend answer is not measuring visibility. It is collecting one artifact from one lab condition.

The same prompt does not reliably produce the same answer

The simplest defense of an AI visibility rank is: we ask the model the same question every week and count whether you show up.

That assumes the same question has a stable answer. It often does not.

Even temperature-zero LLM calls are not perfectly stable in production. Thinking Machines Lab explained one technical reason: inference endpoints can vary because batching and kernel behavior vary under real production load. Their example showed identical temperature-zero requests producing multiple unique completions.

SparkToro and Gumshoe saw the marketing consequence when they had volunteers run repeated commercial prompts through ChatGPT, Claude, and Google's AI products. Their research found that the exact list of recommended brands is highly inconsistent across repeated runs.

This is the core measurement problem. If the next draw from the same system can name a different set of brands, then "you rank number four" is not a fact. It is one sample from a distribution.

An honest dashboard would show the distribution.

The consumer app and the API are different products

Some tools avoid browser scraping and call provider APIs instead. That is cleaner operationally. It is easier to repeat, easier to audit, cheaper to run at scale, and less likely to break because a button moved in a web app.

But it has its own problem: the API is not the consumer app.

The consumer product may have memory, account personalization, model routing, web retrieval, location inference, shopping or local modules, citations, and product-specific presentation layers. The API gives you a programmable model call with whatever tools and parameters you explicitly enable. OpenAI's API docs, for example, require you to add tools such as web search when you want grounded retrieval. Google's Gemini API has its own grounding and search configuration.

That difference cuts both ways. A raw API call can understate what the app would know because it does not browse the same way. A browser scrape can overstate what a real population would see because it captures one personalized session and calls it representative.

The vendor has to pick a poison. The dishonest move is pretending there is no poison.

Prompt sets manufacture the score

AI visibility tools do not monitor the infinite long tail of real buyer questions. They monitor a prompt set.

That prompt set is decisive.

If I track "best AEO agency in NYC," "AI search optimization consultant," and "answer engine optimization audit," I get one picture of Canonry. If I track "SEO agency," "digital marketing firm," and "AI marketing software," I get another. Neither is automatically wrong. But the headline number depends on which questions were selected, how they were weighted, how often they were run, and how competitors were grouped. Profound's own prompt-design guide says its users generally track 100 to 1,000 prompts, with a couple hundred being typical. That is a sample, not the market.

The scoring formula matters just as much. One dashboard can score mention frequency. Another can weight citation position. Another can count source links. Another can blend sentiment. Digital Applied's AI share-of-voice framework gives the clean example this post should not have abused in the intro: the same brand, on the same data, scores 20% mention-based share of voice, 16.8% position-weighted share of voice, and 31.4% citation-based share of voice. Same evidence, three headline numbers, three different competitive standings.

That is why practitioners are skeptical. In the same Digital Applied piece, Dan Taylor of SALT.agency criticizes vendors for measuring small, static prompt sets inside a contrived environment. Digiday reported the same operational problem from the buyer side: Paul Dyer, CEO of /prompt, said that if you give three tools the same prompts, you get three different answers.

Unless the tool shows the prompt list, the number of runs per prompt, the geography, the model, the account state, and the scoring formula, you are not looking at measurement. You are looking at a constructed metric.

Constructed metrics can be useful. They should be labeled as constructed.

Location breaks the leaderboard

Geography is the part most dashboards hand-wave away.

For local, regional, and service-area businesses, location is not metadata. It is part of the question. A user in Brooklyn, Austin, London, or rural Michigan can get materially different recommendations for the same words because the answer engine infers local intent.

That means a single global visibility rank is often meaningless. "Visible in ChatGPT" where? From which user location? With which local retrieval context? With which city or service-area phrase?

Frontend scraping makes this especially messy. A synthetic browser run from a cloud server does not naturally look like a buyer standing in the market you care about. You can try proxies. You can try account pools. You can try browser automation. Now you have a fragile measurement stack where your "truth" depends on whether the frontend accepted the location story your scraper was telling.

API-based measurement has a cleaner path here: pass explicit location context where the provider supports it, and run the same prompt across the geographies you actually care about. That still does not recreate every real user, but it turns location from an accidental artifact into a controlled variable.

That is the direction Canonry takes.

Why local execution matters for local SEO

This is where Canonry's local-first design changes the measurement problem.

Most hosted dashboards run probes from vendor infrastructure. For a national SaaS query, that may be tolerable. For a local client, it is often the wrong instrument. A plumber in Queens, a dentist in Austin, or a roofing contractor in Michigan is trying to influence answers that real buyers ask from phones and laptops inside the service area, not from a scraper cluster in another region.

Canonry can run on a machine in the market. An agency servicing local clients can run checks from its own office, from a technician's laptop, or from another machine whose IP, network environment, time zone, and geography are much closer to the target consumer. That does not magically remove nondeterminism, and it does not make API results identical to every consumer UI. It does remove a major source of measurement error: outsourced cloud geography pretending to be local demand.

For local SEO and local AEO, that is not a small detail. The closer the measurement environment is to the buyer's environment, the less you have to trust a proxy story. You can still pass explicit location context where providers support it. But when the test is running from a machine in the relevant market, the accidental signals line up with the intentional ones instead of fighting them.

That makes Canonry materially more accurate for operators serving local clients. If your customer is a Chicago HVAC company, a Brooklyn hospitality group, or a Michigan roofing contractor, you can run a repeatable prompt set from a local machine and compare it against the same prompt set from another geography. The difference is not noise to average away. It is the thing you are trying to measure.

Model drift turns trend lines into fiction

Even if you solve sampling, personalization, API-vs-app differences, prompt selection, and geography, the instrument itself keeps changing.

The model behind a familiar product name can be updated, routed, rolled back, or silently adjusted. Retrieval systems change. Citation behavior changes. Product interfaces change. A week-over-week movement in your AI visibility dashboard can mean your content improved. It can also mean the model changed, the retrieval layer changed, or the product started answering the prompt differently.

This is not hypothetical. Chen, Zaharia, and Zou's paper "How is ChatGPT's behavior changing over time?" compared March 2023 and June 2023 versions of GPT-3.5 and GPT-4 and found large behavior changes across tasks under the same public model names. One concrete example: GPT-4's prime-number accuracy moved from 84% in March to 51% in June. Treat that as evidence that drift is real and measurable, not as a current estimate of today's model quality.

The same pattern appears in product behavior. In an April 29, 2025 post, OpenAI said it had rolled back the previous week's GPT-4o update in ChatGPT because the removed version was too flattering and agreeable. That is exactly the kind of product-level behavior change an outside visibility dashboard can observe only after it has already polluted the trend line.

From the outside, those effects are hard to separate. A dashboard can tell you that a number moved. It usually cannot prove why.

That does not make the number worthless. It makes causal claims dangerous.

What these tools can honestly tell you

The category is not useless. It just needs to stop pretending a weather forecast is a thermometer.

AI visibility monitoring can support useful conclusions:

  • We are invisible for the commercial prompts buyers actually ask.
  • We appear often on branded prompts but rarely on category prompts.
  • One competitor is cited much more frequently than we are.
  • Claude sees us and ChatGPT does not.
  • We show up in New York but not in Los Angeles.
  • A specific content or schema change appears to correlate with better citation frequency over repeated runs.

Those are directional, probabilistic findings. They are useful. They help teams prioritize work.

What the tools cannot honestly support is fake precision:

  • You are rank number four.
  • You moved up exactly two positions.
  • Your AI share of voice is 17%.
  • This week's lift was caused by last week's blog post.
  • This single screenshot is what your customers see.

Those claims collapse unless the tool shows its samples, its spread, and its method.

How Canonry measures it

Canonry does not try to pretend there is one canonical answer sitting inside ChatGPT waiting to be scraped.

We treat AI visibility as a distribution.

That means the unit of measurement is not "the answer." It is repeated observations across prompts, providers, competitors, and locations. Canonry uses provider APIs because they give us a controlled, repeatable measurement surface. Where the provider supports it, we pass geolocation context instead of hoping a browser scrape inherits the right location from a proxy. We record the prompt, provider, timestamp, configured location, cited domains, mentions, source evidence, and run history so the number can be audited later.

Is that exactly what every real user sees? No.

It is not a logged-in customer with years of ChatGPT history. It is not the exact consumer UI presentation layer. It is not the full long-term distribution of every possible buyer question. It is a controlled sample designed to answer a narrower question: under this prompt set, for this geography, against these competitors, across these providers, how often do we appear?

That is a more honest question. It is also a more useful one.

The downside: honest measurement costs more

There is a reason the cheap dashboard is tempting.

One scrape is cheap. One prompt run is cheap. A single API call with no repetitions and no geography is cheap. A polished line chart built from thin data can look just as confident as an expensive measurement system.

Canonry's approach costs more because it does more work:

  • It runs more than one sample when the question matters.
  • It compares multiple providers instead of collapsing the market into one model.
  • It tracks competitors, not just your own domain.
  • It passes location context where supported.
  • It keeps evidence so the result can be inspected instead of just summarized.
  • It treats prompt sets as configuration, not magic.

That burns tokens. Grounded calls can cost more than plain completions. Repeated runs multiply cost. Location-aware coverage multiplies cost again. If you want New York, Los Angeles, Chicago, London, and Toronto across 200 prompts and four providers, you are not buying a vibe. You are buying a measurement program.

The alternative is cheaper because it is thinner.

The bar for any AI visibility dashboard

If you are buying a tool in this category, ask for the work behind the number.

Ask:

  1. Are you scraping the consumer frontend, calling the API, or both?
  2. If you scrape the frontend, whose account, location, memory state, and subscription tier are represented?
  3. If you call the API, which tools are enabled, and how do you handle web retrieval?
  4. How many runs per prompt produce the number?
  5. Do you report variance or confidence intervals?
  6. Is geography explicit, inferred, or ignored?
  7. Can I see the raw answers and source evidence?
  8. Can I see the prompt list and scoring formula?
  9. Can I separate model drift from my own content changes?

If the dashboard cannot answer those questions, the number is decoration.

The honest future of AI visibility measurement is not a leaderboard. It is a distribution with evidence attached.

That is less sexy than "you are number four."

It is also much closer to the truth.

Are AI visibility tools useless?

No. They can be useful for coarse directional measurement, especially when they report appearance frequency across many prompts and repeated runs. The problem is the false precision: fixed rankings, one-decimal share-of-voice numbers, and leaderboard movement presented without sample sizes, variance, or methodology.

Why is scraping ChatGPT or Claude inaccurate?

A scraped answer is one synthetic account, one prompt set, one location, one model state, and one moment in time. Real users see a distribution of answers shaped by nondeterministic inference, personalization, geography, account state, retrieval choices, and product changes. A scrape can be evidence, but it is not the population.

Is the API more accurate than scraping the frontend?

Not automatically. The API is not the consumer product, so API results can differ from what a person sees in ChatGPT or Claude. Canonry uses provider APIs because they give repeatability, explicit configuration, and controllable geography. That makes the measurement cleaner, not perfect.

Why does Canonry's approach cost more?

Honest measurement requires repeated provider calls across prompts, competitors, models, and geographies. Passing location context and running enough samples to estimate a distribution consumes more API tokens and search-grounding calls than taking a single browser screenshot or cheap one-off probe.

Try it yourself.

Run a free AEO audit to see how your site scores, or explore the tools and pages referenced in this article.

Every AI Visibility Tool Is Lying to You | Canonry