Which AI crawlers should I allow in robots.txt? GPTBot, PerplexityBot, etc.
Community discussion on which AI crawlers to allow or block. Real decisions from webmasters on GPTBot, PerplexityBot, and other AI crawler access for visibility...
I’m trying to figure out the right robots.txt configuration for AI crawlers and the information online is contradictory.
Some articles say block everything to “protect your content.” Others say allow everything for AI visibility. Most don’t even mention specific crawler names.
What I’m trying to understand:
Currently our robots.txt is a mess with rules from 2019 that definitely don’t account for any of this.
Anyone who’s actually done this properly - what’s your setup?
I manage robots.txt for about 40 enterprise sites. Here’s the breakdown that actually matters:
Tier 1 - Must Configure:
GPTBot - OpenAI’s training crawlerChatGPT-User - ChatGPT’s browsing modeClaudeBot - Anthropic’s crawlerGoogle-Extended - Google Gemini trainingPerplexityBot - Perplexity’s indexTier 2 - Worth Considering:
anthropic-ai - Secondary Anthropic crawlerOAI-SearchBot - OpenAI’s search indexerCCBot - Common Crawl (used by many AI companies)What we do:
User-agent: GPTBot
Allow: /blog/
Allow: /resources/
Disallow: /pricing/
Disallow: /admin/
User-agent: PerplexityBot
Allow: /
Key insight: PerplexityBot is the one I always allow fully because it actually cites your pages with links. Blocking it is shooting yourself in the foot for zero benefit.
This is exactly the framework I needed. Quick question - does blocking GPTBot actually remove content from ChatGPT? Or is it already in their training data?
We blocked it 6 months ago but our brand still shows up in ChatGPT responses.
Great question. Blocking GPTBot only affects future training data collection. Content already in their training set (pre-2024 for GPT-4) will still be there.
What it DOES affect:
So if you blocked 6 months ago, ChatGPT still “knows” what it learned before. But it can’t fetch fresh content from your site.
This is why I tell clients: blocking now doesn’t undo the past, it just limits future visibility.
We made a huge mistake blocking all AI crawlers last year based on “content protection” advice.
What happened:
We’ve now reversed course and allow all major AI crawlers. The “protection” argument made no sense once we realized:
The only exception is truly proprietary content behind authentication - and those pages were already disallowed.
Different perspective from heavily regulated industry (healthcare tech).
We have legitimate reasons to control AI access to certain content:
Our approach:
We created a tiered system:
The key is being intentional. “Block everything” and “allow everything” are both lazy approaches. Map your content, understand what each type should do for you, then configure accordingly.
Pro tip that took me way too long to figure out:
Test your robots.txt with actual crawler user-agents.
I thought I had everything configured correctly until I checked our server logs and saw that some AI crawlers weren’t matching our rules because I had typos in the user-agent names.
“GPT-Bot” is not the same as “GPTBot” - guess which one I had wrong for 3 months?
Use Google’s robots.txt tester or command line tools to verify each rule actually matches what you expect.
Here’s my standard recommendation for most businesses:
Allow by default, restrict strategically.
The businesses that benefit from blocking are rare edge cases:
For everyone else, the calculus is simple: AI visibility is a growing traffic source. Perplexity alone drives 200M+ monthly queries. Being invisible there is a strategic disadvantage.
My standard config for clients:
# Allow all AI crawlers to public content
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Google-Extended
Allow: /
# Restrict sensitive areas
Disallow: /admin/
Disallow: /internal/
Disallow: /api/
One thing nobody mentions: monitoring what actually happens after you configure.
I set up alerts for AI bot traffic in our analytics. Noticed some interesting patterns:
This data helps me understand which AI platforms are actually indexing our content. Combined with tools that track AI citations, I can see the full picture of allow robots.txt > AI crawling > AI citations.
Without this monitoring, you’re just guessing about impact.
Publisher perspective here. We run a news/analysis site with 10k+ articles.
What we learned the hard way:
Blocking AI crawlers hurt us in unexpected ways:
The “protection” argument assumes AI is stealing your content. In reality, AI is citing and driving traffic to content it can access. Blocking just means you’re not part of that conversation.
We now allow all AI crawlers and use Am I Cited to monitor how we’re being cited. Our AI referral traffic is up 340% since we made the switch.
This thread has been incredibly helpful. Summary of what I’m implementing based on everyone’s feedback:
Immediate changes:
Monitoring setup: 4. Add server log tracking for AI bot traffic 5. Set up Am I Cited to track actual citations 6. Review in 30 days to see impact
The key insight for me was that blocking doesn’t protect content already in training data - it just limits future visibility. And since AI search is growing rapidly, visibility matters more than “protection.”
Thanks everyone for the real-world configurations and experiences.
Get personalized help from our team. We'll respond within 24 hours.
Track which AI crawlers are accessing your site and how your content appears in AI-generated responses across ChatGPT, Perplexity, and Claude.
Community discussion on which AI crawlers to allow or block. Real decisions from webmasters on GPTBot, PerplexityBot, and other AI crawler access for visibility...
Community discussion on whether to allow GPTBot and other AI crawlers. Site owners share experiences, visibility impacts, and strategic considerations for AI cr...
Community discussion on allowing AI bots to crawl your site. Real experiences with robots.txt configuration, llms.txt implementation, and AI crawler management.
Cookie Consent
We use cookies to enhance your browsing experience and analyze our traffic. See our privacy policy.