Which AI crawlers should I allow in robots.txt? GPTBot, PerplexityBot, etc.
Community discussion on which AI crawlers to allow or block. Real decisions from webmasters on GPTBot, PerplexityBot, and other AI crawler access for visibility...
Just audited a client’s site and discovered something interesting.
The discovery:
Their robots.txt has been blocking AI crawlers for 2+ years:
User-agent: *
Disallow: /private/
# This was added by security plugin in 2023
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
Impact:
Now I’m questioning:
Questions for the community:
Looking for practical configurations, not just theory.
This is more common than people realize. Let me break down the crawlers:
AI Crawler Types:
| Crawler | Company | Purpose | Recommendation |
|---|---|---|---|
| GPTBot | OpenAI | Model training | Your choice |
| ChatGPT-User | OpenAI | Real-time search | Allow |
| ClaudeBot | Anthropic | Real-time citations | Allow |
| Claude-Web | Anthropic | Web browsing | Allow |
| PerplexityBot | Perplexity | Search index | Allow |
| Perplexity-User | Perplexity | User requests | Allow |
| Google-Extended | Gemini/AI features | Allow |
The key distinction:
Most companies:
Allow search crawlers (you want citations) and make a business decision on training crawlers.
Recommended robots.txt:
# Allow AI search crawlers
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
Allow: /
# Block training if desired (optional)
User-agent: GPTBot
Disallow: /
Sitemap: https://yoursite.com/sitemap.xml
Important addition: verify the crawlers are actually being blocked vs just not visiting.
How to check:
What we found at one client:
robots.txt allowed GPTBot, but Cloudflare’s security rules were blocking it as “suspicious bot.”
Firewall configuration for AI bots:
If using Cloudflare:
robots.txt is necessary but not sufficient.
Check all layers of your stack.
Let me explain llms.txt since you asked:
What is llms.txt:
A new standard (proposed 2024) that gives AI systems a structured overview of your site. Think of it as a table of contents specifically for language models.
Location: yoursite.com/llms.txt
Basic structure:
# Your Company Name
> Brief description of your company
## Core Pages
- [Home](https://yoursite.com/): Main entry point
- [Products](https://yoursite.com/products): Product catalog
- [Pricing](https://yoursite.com/pricing): Pricing information
## Resources
- [Blog](https://yoursite.com/blog): Industry insights
- [Documentation](https://yoursite.com/docs): Technical docs
- [FAQ](https://yoursite.com/faq): Common questions
## Support
- [Contact](https://yoursite.com/contact): Get in touch
Why it helps:
AI systems have limited context windows. They can’t crawl your entire site and understand it. llms.txt gives them a curated map.
Our results after implementation:
The training vs search distinction deserves more attention.
The philosophical question:
Do you want your content training AI models?
Arguments for allowing training:
Arguments against:
What publishers are doing:
| Publisher Type | Training | Search |
|---|---|---|
| News sites | Block | Allow |
| SaaS companies | Allow | Allow |
| E-commerce | Varies | Allow |
| Agencies | Allow | Allow |
My recommendation:
Most B2B companies should allow both. The citation benefit outweighs the training concern.
If you’re a content publisher with licensing value, consider blocking training while allowing search.
Let me share actual results from unblocking AI crawlers:
Client A (SaaS):
Before: GPTBot blocked, 0 AI citations After: GPTBot + all crawlers allowed
| Metric | Before | 30 days | 90 days |
|---|---|---|---|
| AI citations | 0 | 12 | 47 |
| AI-referred traffic | 0 | 0.8% | 2.3% |
| Brand searches | baseline | +8% | +22% |
Client B (E-commerce):
Before: All AI blocked After: Search crawlers allowed, training blocked
| Metric | Before | 30 days | 90 days |
|---|---|---|---|
| Product citations | 0 | 34 | 89 |
| AI-referred traffic | 0 | 1.2% | 3.1% |
| Product searches | baseline | +15% | +28% |
The timeline:
Key insight:
Unblocking isn’t instant results. Takes 4-8 weeks to see meaningful impact.
Security perspective on AI crawlers:
Legitimate concerns:
How to mitigate:
Verify crawler identity:
Rate limiting (per crawler):
GPTBot: 100 requests/minute
ClaudeBot: 100 requests/minute
PerplexityBot: 100 requests/minute
Monitor for anomalies:
Official IP ranges:
Each AI company publishes their crawler IPs:
Verify against these before whitelisting.
For WordPress users - common blockers I’ve seen:
Security plugins that block AI:
How to check:
WordPress robots.txt:
WordPress generates robots.txt dynamically. To customize:
Option 1: Use Yoast SEO → Tools → File editor Option 2: Create physical robots.txt in root (overrides) Option 3: Use plugin like “Robots.txt Editor”
Our standard WordPress configuration:
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Good WordPress coverage. Adding: how to create llms.txt for WordPress.
Option 1: Static file
Create llms.txt in your theme’s root and upload to public_html/
Option 2: Plugin approach
Several plugins now support llms.txt generation:
Option 3: Code snippet
// In functions.php
add_action('init', function() {
if ($_SERVER['REQUEST_URI'] == '/llms.txt') {
header('Content-Type: text/plain');
// Output your llms.txt content
exit;
}
});
Best practice:
Keep llms.txt updated when you:
Static file is simplest but requires manual updates.
After you unblock, here’s how to monitor AI crawler activity:
What to track:
| Metric | Where to Find | What It Tells You |
|---|---|---|
| Crawl frequency | Server logs | How often bots visit |
| Pages crawled | Server logs | What content they index |
| Crawl errors | Server logs | Blocking issues |
| AI citations | Am I Cited | Whether crawling converts to visibility |
Server log analysis:
Look for these user-agent patterns:
Simple grep command:
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" access.log
What healthy activity looks like:
Red flags:
This discussion gave me everything I needed. Here’s our implementation plan:
Updated robots.txt:
# Allow AI search crawlers (citations)
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
Allow: /
# Training crawler - allowing for now
User-agent: GPTBot
Allow: /
# Standard rules
User-agent: *
Disallow: /private/
Disallow: /admin/
Sitemap: https://clientsite.com/sitemap.xml
llms.txt implementation:
Created structured overview of client site with:
Firewall updates:
Monitoring setup:
Timeline expectations:
Success metrics:
Thanks everyone for the technical details and real-world configurations.
Get personalized help from our team. We'll respond within 24 hours.
Track which AI bots are crawling your site and how your content appears in AI-generated answers. See the impact of your crawler configuration.
Community discussion on which AI crawlers to allow or block. Real decisions from webmasters on GPTBot, PerplexityBot, and other AI crawler access for visibility...
Community discussion on configuring robots.txt for AI crawlers like GPTBot, ClaudeBot, and PerplexityBot. Real experiences from webmasters and SEO specialists o...
Community discussion on tools that check AI crawlability. How to verify GPTBot, ClaudeBot, and PerplexityBot can access your content.
Cookie Consent
We use cookies to enhance your browsing experience and analyze our traffic. See our privacy policy.