Crawling and AI search- How to stay visible without giving it all away

For 20 years, most SEO professionals only cared about Googlebot. 

However, in the last few years, a host of new crawlers have emerged from different indexing platforms, such as ChatGPT, Perplexity, and others.

These crawlers serve a broader range of purposes.

They’re not just the first step toward indexing.

They can also ingest content for model training or perform retrieval-augmented generation (RAG) on a specific URL on demand.

Which raises the question: Should you allow all these bots to crawl your site?

What if your audience doesn’t use DeepSeek or You.com? What’s the upside of the cost of crawling and the loss of control over how your content is presented?

There is no single “correct” answer, but there is a clear framework for approaching it.

Let’s them eat chunks

Allowing most AI crawlers to access the majority of your content delivers a net benefit. 

However, any truly unique intellectual property should be protected behind a paywall or login to preserve its value.

This means for most content, you will be actively optimizing for AI crawling – enriching and “chunking” content to earn visibility.

Fully understanding that the vast majority of websites will experience traffic drops in the coming years.

But if you’ve filtered for AI-related traffic in GA4, you’ve probably already noticed that the traffic that remains is often significantly higher quality, as AI surfaces are strong pre-qualifiers of user intent.

Beyond traffic, AI surfaces also play a growing role in building brand salience. 

Prominent citations, especially the Top 3 in AI Mode or paragraph-linked mentions in ChatGPT, influence perception.

Optimizing for AI surfaces is, for many business models, the new path to visibility.

Dig deeper: Chunk, cite, clarify, build: A content framework for AI search

AI surfaces become the category page

AI surfaces increasingly act as the “first exposure” points in the user journey, making it essential that your brand shows up early.

They also increasingly function as category pages: 

  • Aggregating offers. 
  • Comparing competitors. 
  • Linking to the “best” ones.

Currently, in rare cases, although I expect this to significantly increase over time, users are converted on the brand’s behalf. But critically, they still rely on the brand for fulfillment.

This isn’t new. It’s how Amazon and other marketplaces have worked for years.

And just like with those platforms, with AI, it’s not about owning every touchpoint. It’s about earning brand salience. 

Providing a great fulfillment experience and high-quality product or service. 

So the next time the user comes in-market, they come to you directly, bypassing AI search altogether.

That is how you win market share.

What if you’re an aggregator?

What about websites that aggregate content from smaller providers – like real estate portals, job boards, or service marketplaces?

Should they be concerned that AI systems might bypass them entirely?

I don’t think so. 

Realistically, even with modern content management systems, small to medium enterprises often struggle to maintain a basic website, let alone navigate the complexities of distributing content to AI platforms.

I don’t see a world where thousands of small websites across the myriad of industries are all efficiently aggregated by AI platforms.

That’s where trustworthy aggregators still play an essential role.

They filter, vet, and standardize. AI systems need that.

Aggregators that provide more than just listings – for example, verified review data – will be even more resistant to AI disintermediation.

Still, AI systems will continue to favor established big brands with enhanced visibility.

Media is the (partial) exception

The real existential risk is to pageview-monetized media. 

Traffic to commodity content is collapsing as answers are served on AI surfaces.

For publishers, or anyone who produces article content, the answer isn’t to block AI entirely. It’s to evolve.

  • Adopt smarter editorial strategies. 
  • Diversify revenue streams. 
  • Focus on winning prominent citations. 
  • Own share of voice – don’t just chase traffic.

Because if you block AI crawling entirely, you’re forfeiting visibility to a competitor.

The only exception? If you have non-replicable content, such as:

  • Highly specialized research.
  • Unique expert advice.
  • Valuable user-generated content, such as reviews at scale.

In such cases, it doesn’t have to be all or nothing – consider partial crawling.

Give bots a taste to earn citations, but don’t let them feast.

This lets your brand stay competitive while preserving your unique advantage.

If we agree that the goal is not just to allow AI crawling but to actively encourage it, the next question becomes: how do you optimize for it from an SEO perspective?

How to optimize for chunking

Being optimized for Googlebot is not enough. 

You now need to cater to a multitude of crawlers, not all of which have the same level of capabilities.

What’s more, indexing is no longer on a URL level. 

Content is broken down into important components, which are stored in a vector database.

Think of each section of your content as a standalone snippet. And win AI citations by:

  • One self-contained idea per paragraph.
  • Paragraphs of 1-4 sentences.
  • Clear subheadings, marked up as H2 or H3.
  • Use proper entity names.
  • High Flesch reading ease score, prioritizing clarity over cleverness.
  • Structured, accessible, semantic HTML.
  • Think multi-modal, ensuring crawlability of images and videos.
  • No JavaScript dependency, as not all crawlers can process it.
  • Use factually accurate, up-to-date information.

If AI crawlers can’t access and understand it, it won’t cite it.

Dig deeper: Inside the AI-powered retrieval stack – and how to win in it

Get the newsletter search marketers rely on.

MktoForms2.loadForm(” “727-ZQE-044”, 16298, function(form) {
// form.onSubmit(function(){
// });

// form.onSuccess(function (values, followUpUrl) {
// });
});

See terms.


You don’t need to spoon-feed with LLMs.txt

Despite the buzz, llms.txt is not an official standard, it’s not widely adopted, and no major AI indexer respects it.

This means the file likely won’t be checked by default, and many sites may see little crawl activity as a result.

Could that change? Maybe.

But until it’s adopted, don’t waste time implementing a file that bots aren’t checking.

Other technical SEO improvements, such as graph-based structured data and improving crawl speed, are far more likely to positively impact visibility on AI surfaces.

Focus on what matters for AI visibility now, not a hypothetical future that is unlikely to ever occur.

How to speed up crawling

I’ve covered:

  • How to measure and improve crawl efficacy.
  • How to optimize crawled pages for rapid indexing.

Many of these tactics for traditional search hold true for AI bots as well:

  • Fast, healthy server response for all bots trending below 600 milliseconds at a maximum, ideally closer to 300.
  • For efficient crawling, ensure a clean and clear URL structure rather than relying on rel=canonical and other such hints. Where this is not possible, block no-SEO-value routes with robots.txt.
  • Gracefully handle pagination.
  • Real-time XML sitemaps submitted in Google Search Console (for Gemini), Bing Webmaster Tools (for ChatGPT and Copilot).
  • Where possible, use Indexing APIs to submit fresh content.

These fundamentals become more important in an AI world, where we see Google is proactively cleaning its index

Rejecting large swaths of previously indexed URLs will, I suspect, improve the quality of “RAGable” content.

That said, measurement of crawling needs to move beyond easily accessible data like the crawl stats report in Google Search Console. 

And focus more on log files, which incorporate clearer reporting on all different types of AI crawlers.

CDNs, like Cloudflare, and AI visibility trackers are offering reporting, making it more accessible than ever.

Requests by AI crawlers

Crawling offers value beyond website indexing

While Googlebot, Bingbot, and AI platforms receive the most attention, SEO tool crawlers also heavily visit many websites.

Before AI systems became prominent, I blocked most of them via .htaccess. They offered little value in return while exposing competitive insights.

Now, my view has changed. I allow them because they contribute to brand visibility in AI-generated content.

ChatGPT - Popular news websites in Australia

It’s one thing for me to claim my website is the most popular – it hits differently when ChatGPT or Gemini says it, backed by Semrush traffic data.

AI systems favor consensus. The more aligned signals they detect, the more likely your messaging is repeated. 

Allowing SEO crawlers to verify your market position, being featured on comparison sites, and being listed in directories all help reinforce your narrative – assuming you’re delivering real value.

In the AI era, it’s not about link building but rather citation management. Curating a crop of crawlable content off-site that corroborates your branding by external citations.

This adds weight. It builds trust.

Crawling is no longer only about website indexing. It’s about digital brand management.

So let the bots crawl. Feed them structured, useful, high-quality chunks.

AI search visibility isn’t just about traffic. It’s trust, positioning, and brand salience.

Dig deeper: Chunks, passages and micro-answer engine optimization wins in Google AI Mode