Discussion Technical SEO AI Crawlers

Should I allow GPTBot and other AI crawlers? Just discovered my robots.txt has been blocking them

WE
WebDev_Technical_Alex · Lead Developer at Marketing Agency
· · 95 upvotes · 10 comments
WT
WebDev_Technical_Alex
Lead Developer at Marketing Agency · January 9, 2026

Just audited a client’s site and discovered something interesting.

The discovery:

Their robots.txt has been blocking AI crawlers for 2+ years:

User-agent: *
Disallow: /private/

# This was added by security plugin in 2023
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

Impact:

  • Zero AI citations for the brand
  • Competitors appearing in AI answers
  • Client wondering why “AI SEO” wasn’t working

Now I’m questioning:

  1. Should we allow ALL AI crawlers?
  2. What’s the difference between training and search crawlers?
  3. Is there a recommended robots.txt configuration?
  4. What about this llms.txt thing I keep hearing about?

Questions for the community:

  1. What’s your robots.txt configuration for AI?
  2. Do you differentiate between crawler types?
  3. Have you implemented llms.txt?
  4. What results did you see after allowing AI crawlers?

Looking for practical configurations, not just theory.

10 comments

10 Comments

TE
TechnicalSEO_Expert_Sarah Expert Technical SEO Consultant · January 9, 2026

This is more common than people realize. Let me break down the crawlers:

AI Crawler Types:

CrawlerCompanyPurposeRecommendation
GPTBotOpenAIModel trainingYour choice
ChatGPT-UserOpenAIReal-time searchAllow
ClaudeBotAnthropicReal-time citationsAllow
Claude-WebAnthropicWeb browsingAllow
PerplexityBotPerplexitySearch indexAllow
Perplexity-UserPerplexityUser requestsAllow
Google-ExtendedGoogleGemini/AI featuresAllow

The key distinction:

  • Training crawlers (GPTBot): Your content trains AI models
  • Search crawlers (ChatGPT-User, PerplexityBot): Your content gets cited in responses

Most companies:

Allow search crawlers (you want citations) and make a business decision on training crawlers.

Recommended robots.txt:

# Allow AI search crawlers
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
Allow: /

# Block training if desired (optional)
User-agent: GPTBot
Disallow: /

Sitemap: https://yoursite.com/sitemap.xml
CM
CrawlerMonitor_Mike · January 9, 2026
Replying to TechnicalSEO_Expert_Sarah

Important addition: verify the crawlers are actually being blocked vs just not visiting.

How to check:

  1. Server logs: Look for user-agent strings
  2. Firewall logs: Check if WAF is blocking
  3. CDN logs: Cloudflare/AWS may rate-limit

What we found at one client:

robots.txt allowed GPTBot, but Cloudflare’s security rules were blocking it as “suspicious bot.”

Firewall configuration for AI bots:

If using Cloudflare:

  • Create firewall rule: Allow if User-Agent contains “GPTBot” OR “PerplexityBot” OR “ClaudeBot”
  • Whitelist official IP ranges (published by each company)

robots.txt is necessary but not sufficient.

Check all layers of your stack.

LL
LLMsExpert_Lisa AI Integration Specialist · January 9, 2026

Let me explain llms.txt since you asked:

What is llms.txt:

A new standard (proposed 2024) that gives AI systems a structured overview of your site. Think of it as a table of contents specifically for language models.

Location: yoursite.com/llms.txt

Basic structure:

# Your Company Name

> Brief description of your company

## Core Pages

- [Home](https://yoursite.com/): Main entry point
- [Products](https://yoursite.com/products): Product catalog
- [Pricing](https://yoursite.com/pricing): Pricing information

## Resources

- [Blog](https://yoursite.com/blog): Industry insights
- [Documentation](https://yoursite.com/docs): Technical docs
- [FAQ](https://yoursite.com/faq): Common questions

## Support

- [Contact](https://yoursite.com/contact): Get in touch

Why it helps:

AI systems have limited context windows. They can’t crawl your entire site and understand it. llms.txt gives them a curated map.

Our results after implementation:

  • AI citations up 23% within 6 weeks
  • More accurate brand representation in AI answers
  • Faster indexing of new content by AI systems
CC
ContentLicensing_Chris · January 8, 2026

The training vs search distinction deserves more attention.

The philosophical question:

Do you want your content training AI models?

Arguments for allowing training:

  • Better AI = better citations of your content
  • Industry thought leadership spreads through AI
  • Can’t opt out of past training anyway

Arguments against:

  • No compensation for content use
  • Competitors benefit from your content
  • Licensing concerns

What publishers are doing:

Publisher TypeTrainingSearch
News sitesBlockAllow
SaaS companiesAllowAllow
E-commerceVariesAllow
AgenciesAllowAllow

My recommendation:

Most B2B companies should allow both. The citation benefit outweighs the training concern.

If you’re a content publisher with licensing value, consider blocking training while allowing search.

RT
ResultsTracker_Tom Expert · January 8, 2026

Let me share actual results from unblocking AI crawlers:

Client A (SaaS):

Before: GPTBot blocked, 0 AI citations After: GPTBot + all crawlers allowed

MetricBefore30 days90 days
AI citations01247
AI-referred traffic00.8%2.3%
Brand searchesbaseline+8%+22%

Client B (E-commerce):

Before: All AI blocked After: Search crawlers allowed, training blocked

MetricBefore30 days90 days
Product citations03489
AI-referred traffic01.2%3.1%
Product searchesbaseline+15%+28%

The timeline:

  • Week 1-2: Crawlers discover and index content
  • Week 3-4: Start appearing in AI answers
  • Month 2-3: Significant citation growth

Key insight:

Unblocking isn’t instant results. Takes 4-8 weeks to see meaningful impact.

SR
SecurityExpert_Rachel DevSecOps Engineer · January 8, 2026

Security perspective on AI crawlers:

Legitimate concerns:

  1. Rate limiting - AI bots can be aggressive crawlers
  2. Content scraping - distinguishing AI bots from scrapers
  3. Attack surface - allowing more bots = more potential vectors

How to mitigate:

  1. Verify crawler identity:

    • Check user-agent string
    • Verify IP against published ranges
    • Use reverse DNS lookup
  2. Rate limiting (per crawler):

    GPTBot: 100 requests/minute
    ClaudeBot: 100 requests/minute
    PerplexityBot: 100 requests/minute
    
  3. Monitor for anomalies:

    • Sudden traffic spikes
    • Unusual crawl patterns
    • Requests to sensitive areas

Official IP ranges:

Each AI company publishes their crawler IPs:

Verify against these before whitelisting.

WJ
WordPressExpert_Jake · January 7, 2026

For WordPress users - common blockers I’ve seen:

Security plugins that block AI:

  • Wordfence (default settings may block)
  • Sucuri (bot blocking features)
  • All In One Security
  • iThemes Security

How to check:

  1. Wordfence: Firewall → Blocking → Advanced Blocking
  2. Sucuri: Firewall → Access Control → Bot List
  3. Check “blocked” logs for AI crawler user-agents

WordPress robots.txt:

WordPress generates robots.txt dynamically. To customize:

Option 1: Use Yoast SEO → Tools → File editor Option 2: Create physical robots.txt in root (overrides) Option 3: Use plugin like “Robots.txt Editor”

Our standard WordPress configuration:

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yoursite.com/sitemap.xml
TE
TechnicalSEO_Expert_Sarah Expert · January 7, 2026
Replying to WordPressExpert_Jake

Good WordPress coverage. Adding: how to create llms.txt for WordPress.

Option 1: Static file

Create llms.txt in your theme’s root and upload to public_html/

Option 2: Plugin approach

Several plugins now support llms.txt generation:

  • AI Content Shield
  • RankMath (in recent versions)
  • Custom plugin using template

Option 3: Code snippet

// In functions.php
add_action('init', function() {
    if ($_SERVER['REQUEST_URI'] == '/llms.txt') {
        header('Content-Type: text/plain');
        // Output your llms.txt content
        exit;
    }
});

Best practice:

Keep llms.txt updated when you:

  • Add major new content sections
  • Change site structure
  • Launch new products/services

Static file is simplest but requires manual updates.

MM
MonitoringSetup_Maria · January 7, 2026

After you unblock, here’s how to monitor AI crawler activity:

What to track:

MetricWhere to FindWhat It Tells You
Crawl frequencyServer logsHow often bots visit
Pages crawledServer logsWhat content they index
Crawl errorsServer logsBlocking issues
AI citationsAm I CitedWhether crawling converts to visibility

Server log analysis:

Look for these user-agent patterns:

  • “GPTBot” - OpenAI
  • “ClaudeBot” - Anthropic
  • “PerplexityBot” - Perplexity
  • “Google-Extended” - Google AI

Simple grep command:

grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" access.log

What healthy activity looks like:

  • Multiple AI bots crawling regularly
  • Coverage of important pages
  • No crawl errors on key content
  • Increasing citations over time

Red flags:

  • Zero AI crawler activity after unblocking
  • High error rates
  • Only crawling robots.txt (can’t get past)
WT
WebDev_Technical_Alex OP Lead Developer at Marketing Agency · January 6, 2026

This discussion gave me everything I needed. Here’s our implementation plan:

Updated robots.txt:

# Allow AI search crawlers (citations)
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
Allow: /

# Training crawler - allowing for now
User-agent: GPTBot
Allow: /

# Standard rules
User-agent: *
Disallow: /private/
Disallow: /admin/

Sitemap: https://clientsite.com/sitemap.xml

llms.txt implementation:

Created structured overview of client site with:

  • Core pages
  • Product/service categories
  • Resource sections
  • Contact information

Firewall updates:

  • Whitelisted official AI crawler IP ranges
  • Set appropriate rate limits
  • Added monitoring for crawler activity

Monitoring setup:

  • Server log parsing for AI crawler activity
  • Am I Cited for citation tracking
  • Weekly check on crawl patterns

Timeline expectations:

  • Week 1-2: Verify crawlers are accessing site
  • Week 3-4: Start seeing initial citations
  • Month 2-3: Full citation growth

Success metrics:

  • AI crawler visits (target: daily from each platform)
  • AI citations (target: 30+ in first 90 days)
  • AI-referred traffic (target: 2%+ of organic)

Thanks everyone for the technical details and real-world configurations.

Have a Question About This Topic?

Get personalized help from our team. We'll respond within 24 hours.

Frequently Asked Questions

Are AI bots blocked by default?
No, AI bots are NOT blocked by default. They crawl your site unless explicitly disallowed in robots.txt. However, some older robots.txt files, security plugins, or firewalls may inadvertently block AI crawlers. Check your configuration to ensure GPTBot, ClaudeBot, PerplexityBot, and Google-Extended can access your content.
What's the difference between training crawlers and search crawlers?
Training crawlers (like GPTBot) collect data for AI model training, meaning your content may train future AI versions. Search crawlers (like PerplexityBot, ChatGPT-User) fetch content for real-time AI responses, meaning your content gets cited in answers. Many companies block training crawlers while allowing search crawlers.
What is llms.txt and should I implement it?
llms.txt is a new standard that provides AI systems with a structured overview of your site. It acts as a table of contents specifically for language models, helping them understand your site structure and find important content. It’s recommended for AI visibility but not required like robots.txt.

Monitor AI Crawler Activity

Track which AI bots are crawling your site and how your content appears in AI-generated answers. See the impact of your crawler configuration.

Learn more