Enterprise AI29 Apr 20267 min read

Why your website is invisible to ChatGPT even if Google can see it

Five specific technical reasons a Google-indexed site fails the GPTBot crawl. None of them are obvious. All of them are fixable in under a sprint.

What does it actually mean for a site to be "invisible to ChatGPT"?

Three different failure modes live under the same surface complaint. Most operators conflate them, which is part of why the fix often takes longer than it should.

Mode A — the crawler can't reach the content. GPTBot requests the page; the site returns a 403, a redirect loop, or a robots.txt block. The page never enters the engine's training-data candidate pool.

Mode B — the crawler reaches but can't render. GPTBot fetches the URL successfully; the response is a JavaScript shell with no content. The engine sees an empty page. Your site exists from a Google-rendering perspective (Googlebot has a JS rendering pass) but doesn't exist from an OpenAI-extraction perspective (GPTBot doesn't run JavaScript at all).

Mode C — the crawler renders but can't extract. GPTBot fetches and reads the HTML. The HTML is intact. But the prose is structured in a way that gives the engine nothing to anchor on — no headings, no schema, no clear thesis sentences. The page exists in the engine's index but never gets cited because there's nothing extractable.

Each mode has different fixes. Most operators trying to "get ChatGPT to mention us" haven't yet diagnosed which mode they're failing on.

So why does Google index a site that GPTBot doesn't?

Five reasons, none obvious from a Google Search Console review.

Reason 1 — your robots.txt blocks the AI crawler explicitly

The most common reason and the cheapest fix. Many sites built between 2022 and 2024 added explicit Disallow rules for GPTBot, ClaudeBot, anthropic-ai, and CCBot — either as a privacy posture or as a hedge against training-data extraction. The rules then never got revisited.

What it looks like in robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

The fix: open robots.txt, find the Disallow: / rules under each AI user-agent, decide consciously whether to keep or remove them. If you want the engines to cite you, the rules go. Five-minute fix, deployed immediately, takes effect on the next crawler refresh (7 to 14 days for most engines). The full crawler list and the unblock checklist live at what is an AI visibility audit.

Reason 2 — your CMS injects a wildcard block via headers

Less obvious. Some hosting platforms (especially headless or edge-deployed setups) inject a default X-Robots-Tag: noai, noimageai header on every response. The robots.txt looks clean, but the header tells the engines not to train on the content. Inspect the actual response with curl -I https://yourdomain.com/ and look for the header.

The fix: configure the platform to strip or override the header on production. Some platforms (Vercel, Netlify, Cloudflare Pages) have UI controls; others require a config-file change.

Reason 3 — your content lives behind JavaScript hydration

The Google failure mode that doesn't apply: Googlebot has a delayed JS-rendering pass. It will fetch your React shell, queue the page for rendering, and index the rendered content within minutes to hours. GPTBot has no such pass. It fetches the HTML response and reads what's there.

Sites built with client-side React, Vue, or Svelte without server-side rendering or static generation — and there are still many of them — return a near-empty HTML shell to GPTBot. The crawler extracts nothing.

The fix: deploy the site with server-side rendering (Next.js App Router, Remix, SvelteKit, etc.) or static generation (Astro, Eleventy, Hugo). The technology choice is downstream; the requirement is that the HTML response contains the readable content. If you can curl your homepage and read the prose in the response body, GPTBot can read it too. If curl returns <div id="root"></div> and a script tag, GPTBot returns nothing.

Reason 4 — your content is gated behind an interaction wall

Cookie banners, age gates, region-detection redirects, login walls. Each one breaks the crawler's path. GPTBot won't click "Accept all cookies". It won't dismiss your newsletter modal. It won't sign in to your gated content.

The fix is more nuanced than the crawler-block fix. Some interaction walls are legally required (cookie consent in EU jurisdictions, age verification for regulated content). The engineering question is whether the wall blocks crawlers specifically — GDPR-compliant cookie banners shouldn't block server-rendered crawler access at all. If yours does, you have a misconfigured banner, not a regulatory requirement.

For login walls, the question is "is this content commercially valuable for AI citation?" If yes, expose it (or a structured summary of it) on a public page. If no, leave the gate.

Reason 5 — your content is intact but the engines have no anchor to extract from

The hardest mode to diagnose because the site looks fine. HTML renders, content is readable, no crawler blocks. But the page is a wall of prose with no schema, no FAQ structure, no thesis-first paragraphs, no inline citations, no question-shape headings.

The engines extract more reliably from JSON-LD structured data than from prose. They extract more reliably from FAQ-shaped content than from narrative content. They extract more reliably from pages where the first sentence states the page's thesis than from pages that open with throat-clearing.

The fix is the longest of the five — it's a content-shape rewrite, not a five-minute config change. The full citability rubric lives at what is GEO. The audit dossier at /audit flags every page that fails the rubric and ranks them by traffic value.

How do you actually test which mode you're failing on?

Three commands, in order.

Test 1 — fetch with named user-agent.

curl -A "GPTBot" https://yourdomain.com/ -I

The response status tells you mode A. If it's 403 or 301-to-blocker, you have a crawler-block problem. If it's 200, move to test 2.

Test 2 — read the response body.

curl -A "GPTBot" https://yourdomain.com/ | grep -c "<p>"

If the count is in the dozens, the page rendered fine and you're not in mode B. If the count is 0 or 1, your content is hidden behind hydration and you have a rendering problem.

Test 3 — inspect the schema.

curl -A "GPTBot" https://yourdomain.com/ | grep "application/ld+json"

Zero results means no JSON-LD schema is being emitted. Your page is in mode C — readable but not extractable.

Three commands. Each one diagnoses one mode. Operators usually find at least one mode broken; many find two or three.

What's the order of operations to fix this?

The dossier ranks fixes by revenue impact. The general pattern:

Sprint 01 — modes A and B. Crawler unblock, header strip, rendering pipeline fix if needed. These are the cheap, high-confidence fixes. Most operators see citation lift within 30 days of these landing.

Sprint 02 — mode C, schema layer. Deploy Organization, FAQPage, Article with citation arrays, Person, Service, Review. Six types, deployed across surface pages. Compounds on the Sprint 01 unblock.

Sprint 03 — mode C, content-shape layer. Rewrite the highest-traffic pages against the citability rubric. Lead with thesis sentences. Add question-shape headings. Insert inline citations. Compounds on the schema layer from Sprint 02.

The sequencing isn't aesthetic. Each sprint's lift compounds the next sprint's lift. Skipping Sprint 01 and going straight to schema produces narrow lift. The full sprint plan with the 60-day cadence lives at the sample 14-day AI sprint plan.

What does the operator do this week?

Two concrete moves.

Move 1 — run the three test commands. Twenty minutes, total. The output tells you which mode you're failing on. If you fail multiple modes, sequence the fixes: A first, B second, C last.

Move 2 — request the audit if mode C looks broken. Modes A and B are usually fixable in-house once diagnosed. Mode C requires the schema-coverage scoring and the content-shape rubric, both of which take judgment. The /audit ships in 5 business days with the rubric applied to your specific surface pages.

Where to go from here

What is GEO? /answers/what-is-geo.
What schema matters? /answers/what-schema-matters-for-ai-visibility.
The audit: /answers/what-is-an-ai-visibility-audit.
Or just request the audit: /audit. The three-command test is the start; the dossier is the diagnosis.