Discussion Technical SEO Debugging

AI bots are hitting our site but we're not getting cited. How do I debug crawling issues?

WE
WebDev_Marcus · Senior Web Developer
· · 68 upvotes · 9 comments
WM
WebDev_Marcus
Senior Web Developer · January 2, 2026

Confusing situation:

Our server logs show regular hits from GPTBot, PerplexityBot, and ClaudeBot. They’re getting 200 responses. So they’re definitely crawling our content.

But when I ask ChatGPT, Perplexity, or Claude questions that our content covers perfectly, we never get cited. Competitors with objectively worse content get cited instead.

What I’ve verified:

  • robots.txt allows all AI crawlers
  • Pages return 200 status
  • Content is server-rendered (no client-only JS)
  • Pages are fast (<2s load time)

What I’m trying to figure out:

  • How do I see what the crawlers actually see?
  • What could cause crawling success but citation failure?
  • Are there hidden technical issues I’m missing?

This is driving me crazy. The crawlers visit, but we’re invisible to AI responses.

9 comments

9 Comments

CE
CrawlerDebug_Expert Expert Technical SEO Consultant · January 2, 2026

Let me help debug this. Crawling ≠ citing. Here’s the diagnostic framework:

Step 1: Verify what crawlers actually see

Use curl with the AI user-agent:

curl -A "GPTBot" -s https://yoursite.com/page | head -100

Check:

  • Does the full content appear?
  • Are there any meta robots or X-Robots-Tag headers?
  • Is the content in the HTML, not requiring JS execution?

Step 2: Check for hidden blockers

Common issues:

  • noindex meta tag (blocks indexing)
  • X-Robots-Tag: noindex header
  • Canonical pointing elsewhere
  • Content loaded via JavaScript after page load
  • Login/paywall detection that serves different content to bots

Step 3: Content quality check

If crawling is fine, the issue is content:

  • Is it truly unique, or a variation of common content?
  • Is it structured for AI extraction?
  • Does it have authority signals (author, citations)?
  • Is it comprehensive enough to be THE source?

Most common issue I see:

Technical crawling is fine. Content just isn’t citation-worthy. Crawlers visit, but AI systems choose better sources.

The gap between “accessible” and “citable” is about content quality and structure, not just technical access.

WM
WebDev_Marcus OP · January 2, 2026
Replying to CrawlerDebug_Expert
The curl test is helpful. I ran it and the content appears. No noindex tags. But you’re right - maybe the issue isn’t technical at all. How do I evaluate if content is “citation-worthy”?
CE
CrawlerDebug_Expert Expert · January 2, 2026
Replying to WebDev_Marcus

Citation-worthiness checklist:

1. Uniqueness

  • Does your content say something competitors don’t?
  • Original data, research, or insights?
  • Or just repackaging common information?

2. Structure

  • Can AI extract a clean answer from your content?
  • Is there a TL;DR or direct answer?
  • Are sections clearly delineated?

3. Authority

  • Author with credentials?
  • Citations to sources?
  • Fresh/updated content?

4. Comprehensiveness

  • Does this fully answer the question?
  • Or does AI need to combine with other sources?

The hard truth:

Most content online is mediocre. AI has millions of options to cite. It picks the best ones.

If your content is:

  • Similar to 100 other sites
  • Structured like a narrative, not an answer
  • No clear authority signals
  • Not the most comprehensive source

…then it won’t get cited, regardless of technical access.

Compare your content to what IS getting cited. What do they have that you don’t?

LP
LogAnalysis_Pro DevOps Engineer · January 1, 2026

Here’s how I analyze AI crawler behavior in logs:

Log analysis for AI crawlers:

# Find all AI crawler hits
grep -E "(GPTBot|ChatGPT-User|ClaudeBot|PerplexityBot|Google-Extended)" access.log

# Check status codes
grep "GPTBot" access.log | awk '{print $9}' | sort | uniq -c

# See which pages they hit most
grep "GPTBot" access.log | awk '{print $7}' | sort | uniq -c | sort -rn

What to look for:

  1. Status codes

    • 200: Good, they got the content
    • 301/302: Redirects - check they land somewhere useful
    • 403/404: Problems - fix immediately
    • 500: Server errors - investigate
  2. Crawl patterns

    • Which pages get crawled most?
    • Are your best pages being visited?
    • Any pages never crawled?
  3. Crawl frequency

    • GPTBot: Usually multiple times daily
    • PerplexityBot: Very frequent (real-time search)
    • If no hits in weeks, check robots.txt

Common log issues:

  • CDN hiding real user agents
  • Load balancer stripping headers
  • Log rotation missing crawler hits

Make sure you’re seeing raw, unfiltered logs.

CS
ContentQuality_Sarah · January 1, 2026

Since you’ve verified technical access, let me address the content side:

Why AI might crawl but not cite:

  1. Content is generic “5 tips for better email marketing” - there are 10,000 of these. AI cites the best one, not all of them.

  2. No extractable answer Narrative content without clear takeaways is hard for AI to quote.

  3. Outdated information If your content says “2023 trends,” AI may prefer current sources.

  4. Weak authority signals No author, no sources cited, no credentials displayed.

  5. Poor structure AI needs clear sections it can parse. Flowing text is harder to extract.

Diagnostic test:

Ask yourself: If I were AI and had to cite ONE source for this topic, would I pick my content or a competitor’s?

Be honest. What does the competitor have that you don’t?

Usually it’s:

  • More comprehensive coverage
  • Better structure for extraction
  • Stronger authority signals
  • More current information

Improve those, and citations follow.

JD
JSRendering_Dev · January 1, 2026

Technical deep-dive on JavaScript rendering:

Even if your main content is server-rendered, check for:

1. Lazy-loaded content sections Important content below the fold might load after initial render.

// This content might not appear to crawlers
<div data-lazy="true">Important content here</div>

2. Interactive elements that hide content Tabs, accordions, expandable sections might have content AI can’t access.

3. JavaScript-generated structured data If your schema is injected via JS, crawlers might not see it.

Testing tool:

Google’s Mobile-Friendly Test shows rendered HTML: https://search.google.com/test/mobile-friendly

Compare what you see there vs. your actual page. Any differences might explain visibility issues.

Quick fix:

View your page with JavaScript disabled. Whatever’s visible there is what crawlers definitely see. If key content is missing, that’s your problem.

ST
SchemaDebug_Tom · December 31, 2025

Schema issues that prevent citations:

Even if content is visible, bad schema can hurt you:

  1. Invalid schema markup Use Google’s Rich Results Test to validate. Invalid schema might be ignored entirely.

  2. Missing schema No Organization, Article, or FAQ schema means AI has to guess about your content type.

  3. Conflicting schema Multiple Organization schemas with different info. AI doesn’t know which to trust.

How to test:

# Fetch and check for schema
curl -s https://yoursite.com | grep -o 'application/ld+json' | wc -l

Then validate each schema block at: https://validator.schema.org/

Common schema errors:

  • Missing @context
  • Wrong @type
  • Invalid date formats
  • URL fields without http/https
  • Missing required properties

Fix schema errors. AI systems parse schema to understand content. Invalid schema = unclear content.

WM
WebDev_Marcus OP Senior Web Developer · December 30, 2025

This thread helped me realize: our issue isn’t technical.

What I tested:

  • curl with AI user-agents: content renders correctly
  • No noindex tags anywhere
  • Schema validates correctly
  • JavaScript doesn’t hide key content
  • Logs show regular crawler visits with 200s

What I found comparing to competitors who get cited:

Their content has:

  • Direct answer in first paragraph (ours buries the answer)
  • FAQ sections with schema (we have neither)
  • Author bios with credentials (ours are generic)
  • Comparison tables (we use narrative paragraphs)
  • Updated dates (ours haven’t been touched in 18 months)

My action plan:

  1. Stop debugging technical issues (they’re not the problem)
  2. Focus on content quality and structure
  3. Add FAQ sections with schema
  4. Restructure for direct answers
  5. Add author credentials
  6. Update stale content

Key insight:

Crawling working + not getting cited = content quality/structure problem, not technical problem.

I was debugging the wrong layer. Thanks everyone!

Have a Question About This Topic?

Get personalized help from our team. We'll respond within 24 hours.

Frequently Asked Questions

How do I know if AI crawlers are accessing my site?
Check server logs for AI crawler user agents: GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended. Look for 200 status codes confirming successful access. Most AI crawlers hit frequently - if you don’t see them, check your robots.txt isn’t blocking them.
Why might AI crawlers access my content but not cite it?
Common reasons: content is too thin or generic to be citation-worthy, content structure makes extraction difficult, content lacks authority signals, content is outdated, or better sources exist on the topic. Crawling is just access - citation requires content that AI deems valuable enough to reference.
How do I test what AI crawlers actually see on my pages?
Use curl with AI user-agent headers to fetch your pages. Check if JavaScript-rendered content appears. View page source vs rendered page to see what crawlers get. Test that key content isn’t in lazy-loaded sections or behind JavaScript that crawlers can’t execute.

Monitor AI Crawler Activity

Track which AI crawlers access your site and how your content appears in AI responses.

Learn more