
Hvilke AI-crawlere bør jeg tillade i robots.txt? GPTBot, PerplexityBot, osv.
Fællesskabsdiskussion om hvilke AI-crawlere, der skal tillades eller blokeres. Virkelige beslutninger fra webmasters om adgang til GPTBot, PerplexityBot og andr...
Just audited a client’s site and discovered something interesting.
The discovery:
Their robots.txt has been blocking AI crawlers for 2+ years:
User-agent: *
Disallow: /private/
# This was added by security plugin in 2023
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
Impact:
Now I’m questioning:
Questions for the community:
Looking for practical configurations, not just theory.
This is more common than people realize. Let me break down the crawlers:
AI Crawler Types:
| Crawler | Company | Purpose | Recommendation |
|---|---|---|---|
| GPTBot | OpenAI | Model training | Your choice |
| ChatGPT-User | OpenAI | Real-time search | Allow |
| ClaudeBot | Anthropic | Real-time citations | Allow |
| Claude-Web | Anthropic | Web browsing | Allow |
| PerplexityBot | Perplexity | Search index | Allow |
| Perplexity-User | Perplexity | User requests | Allow |
| Google-Extended | Gemini/AI features | Allow |
The key distinction:
Most companies:
Allow search crawlers (you want citations) and make a business decision on training crawlers.
Recommended robots.txt:
# Allow AI search crawlers
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
Allow: /
# Block training if desired (optional)
User-agent: GPTBot
Disallow: /
Sitemap: https://yoursite.com/sitemap.xml
Important addition: verify the crawlers are actually being blocked vs just not visiting.
How to check:
What we found at one client:
robots.txt allowed GPTBot, but Cloudflare’s security rules were blocking it as “suspicious bot.”
Firewall configuration for AI bots:
If using Cloudflare:
robots.txt is necessary but not sufficient.
Check all layers of your stack.
Let me explain llms.txt since you asked:
What is llms.txt:
A new standard (proposed 2024) that gives AI systems a structured overview of your site. Think of it as a table of contents specifically for language models.
Location: yoursite.com/llms.txt
Basic structure:
# Your Company Name
> Brief description of your company
## Core Pages
- [Home](https://yoursite.com/): Main entry point
- [Products](https://yoursite.com/products): Product catalog
- [Pricing](https://yoursite.com/pricing): Pricing information
## Resources
- [Blog](https://yoursite.com/blog): Industry insights
- [Documentation](https://yoursite.com/docs): Technical docs
- [FAQ](https://yoursite.com/faq): Common questions
## Support
- [Contact](https://yoursite.com/contact): Get in touch
Why it helps:
AI systems have limited context windows. They can’t crawl your entire site and understand it. llms.txt gives them a curated map.
Our results after implementation:
The training vs search distinction deserves more attention.
The philosophical question:
Do you want your content training AI models?
Arguments for allowing training:
Arguments against:
What publishers are doing:
| Publisher Type | Training | Search |
|---|---|---|
| News sites | Block | Allow |
| SaaS companies | Allow | Allow |
| E-commerce | Varies | Allow |
| Agencies | Allow | Allow |
My recommendation:
Most B2B companies should allow both. The citation benefit outweighs the training concern.
If you’re a content publisher with licensing value, consider blocking training while allowing search.
Let me share actual results from unblocking AI crawlers:
Client A (SaaS):
Before: GPTBot blocked, 0 AI citations After: GPTBot + all crawlers allowed
| Metric | Before | 30 days | 90 days |
|---|---|---|---|
| AI citations | 0 | 12 | 47 |
| AI-referred traffic | 0 | 0.8% | 2.3% |
| Brand searches | baseline | +8% | +22% |
Client B (E-commerce):
Before: All AI blocked After: Search crawlers allowed, training blocked
| Metric | Before | 30 days | 90 days |
|---|---|---|---|
| Product citations | 0 | 34 | 89 |
| AI-referred traffic | 0 | 1.2% | 3.1% |
| Product searches | baseline | +15% | +28% |
The timeline:
Key insight:
Unblocking isn’t instant results. Takes 4-8 weeks to see meaningful impact.
Security perspective on AI crawlers:
Legitimate concerns:
How to mitigate:
Verify crawler identity:
Rate limiting (per crawler):
GPTBot: 100 requests/minute
ClaudeBot: 100 requests/minute
PerplexityBot: 100 requests/minute
Monitor for anomalies:
Official IP ranges:
Each AI company publishes their crawler IPs:
Verify against these before whitelisting.
For WordPress users - common blockers I’ve seen:
Security plugins that block AI:
How to check:
WordPress robots.txt:
WordPress generates robots.txt dynamically. To customize:
Option 1: Use Yoast SEO → Tools → File editor Option 2: Create physical robots.txt in root (overrides) Option 3: Use plugin like “Robots.txt Editor”
Our standard WordPress configuration:
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Good WordPress coverage. Adding: how to create llms.txt for WordPress.
Option 1: Static file
Create llms.txt in your theme’s root and upload to public_html/
Option 2: Plugin approach
Several plugins now support llms.txt generation:
Option 3: Code snippet
// In functions.php
add_action('init', function() {
if ($_SERVER['REQUEST_URI'] == '/llms.txt') {
header('Content-Type: text/plain');
// Output your llms.txt content
exit;
}
});
Best practice:
Keep llms.txt updated when you:
Static file is simplest but requires manual updates.
After you unblock, here’s how to monitor AI crawler activity:
What to track:
| Metric | Where to Find | What It Tells You |
|---|---|---|
| Crawl frequency | Server logs | How often bots visit |
| Pages crawled | Server logs | What content they index |
| Crawl errors | Server logs | Blocking issues |
| AI citations | Am I Cited | Whether crawling converts to visibility |
Server log analysis:
Look for these user-agent patterns:
Simple grep command:
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" access.log
What healthy activity looks like:
Red flags:
This discussion gave me everything I needed. Here’s our implementation plan:
Updated robots.txt:
# Allow AI search crawlers (citations)
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
Allow: /
# Training crawler - allowing for now
User-agent: GPTBot
Allow: /
# Standard rules
User-agent: *
Disallow: /private/
Disallow: /admin/
Sitemap: https://clientsite.com/sitemap.xml
llms.txt implementation:
Created structured overview of client site with:
Firewall updates:
Monitoring setup:
Timeline expectations:
Success metrics:
Thanks everyone for the technical details and real-world configurations.
Get personalized help from our team. We'll respond within 24 hours.
Spor hvilke AI-bots der crawler dit site, og hvordan dit indhold vises i AI-genererede svar. Se effekten af din crawlerkonfiguration.

Fællesskabsdiskussion om hvilke AI-crawlere, der skal tillades eller blokeres. Virkelige beslutninger fra webmasters om adgang til GPTBot, PerplexityBot og andr...

Fællesskabsdiskussion om konfiguration af robots.txt til AI-crawlere som GPTBot, ClaudeBot og PerplexityBot. Reelle erfaringer fra webansvarlige og SEO-speciali...

Lær hvordan webapplikationsfirewalls giver avanceret kontrol over AI-crawlere ud over robots.txt. Implementer WAF-regler for at beskytte dit indhold mod uautori...
Cookie Samtykke
Vi bruger cookies til at forbedre din browsingoplevelse og analysere vores trafik. See our privacy policy.