Noindex, Robots.txt, Removal and Bot Blocking: What’s Hapening These Days

27 Min Read

Search engine de-indexing used to be straightforward. Add a noindex tag or block something in robots.txt, and search engines would comply within a few weeks. That was before AI crawlers showed up and changed the rules.

Today, managing what gets indexed involves not just Google and Bing, but dozens of AI bots training language models, scraping content for chatbot answers, and ignoring the polite requests that worked for decades. The tools haven’t changed much, but how they’re used—and what they can actually prevent—has shifted significantly.

Let’s move.

What De-Indexing Actually Means

De-indexing removes a page from a search engine’s index—the database of pages that can appear in search results. If a page isn’t indexed, it won’t rank or show up when someone searches.

This is different from deleting a page or putting it behind a login. A page can exist publicly on the web and still be de-indexed if search engines are instructed not to include it.

De-indexing typically happens in three ways:

  • A site owner adds a noindex directive
  • A page becomes unreachable or returns an error code
  • A search engine removes it due to policy violations or legal requests

Only the first method gives you direct control. Understanding when to use noindex versus other blocking methods matters because picking the wrong tool can delay removal or leave content partially visible.

Setting Up Proper Indexing Before You Block Anything

Before worrying about what to block, make sure the pages you want indexed aren’t accidentally blocked. This sounds obvious, but it’s one of the most common mistakes in SEO cleanup work.

Important directories must not be in robots.txt. If your site’s main content, product pages, or blog posts are blocked from crawling, Google can’t index them properly—even if you want them to rank.

Here’s a basic robots.txt file for a WordPress site that allows indexing of important content:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/

Sitemap: https://theseospot.com/blog/sitemap_index.xml

This blocks backend directories and plugins from being crawled while allowing posts, pages, and media files. The sitemap line helps search engines find your important content faster.

Common directories that should NOT be blocked if you want them to rank:

  • /blog/
  • /products/
  • /services/
  • /category/
  • Any content meant for public discovery

Getting the basics right—allowing crawling of indexable content—makes everything else in this guide actually work.

Why Robots.txt Doesn’t Do What Most People Think

Robots.txt is one of the most misunderstood tools in SEO. People assume it controls what gets indexed. It doesn’t. It controls what gets crawled.

This screenshot from Google Search Console shows 96 pages that are indexed in Google despite being blocked by robots.txt. This isn’t a bug. This is how the system works.

Another site has over 502,000 pages blocked by robots.txt. Some of these are still appearing in Google’s index with partial information because Google found links to them from other sites, even though it couldn’t crawl the content.

My opinion and web dynamics these days: robots.txt is advisory, not a command. It’s often described as a “gentleman’s agreement” because complying with it is voluntary. Google and Bing usually respect it, but they’re not required to.

More importantly, if a URL is blocked by robots.txt but Google discovers it through external links, Google may still index that URL—they just won’t have information about what’s on the page. The result? A search listing that shows the URL with no description, which looks unprofessional and still exposes the page exists.

Robots.txt is useful for:

  • Managing crawl budget on large sites
  • Blocking low-value sections like admin panels or search result pages
  • Preventing duplicate content from being crawled
  • Controlling access to staging or development environments

Robots.txt is the wrong choice for:

  • Removing pages from search results
  • Hiding content you don’t want discovered
  • Urgent reputation cleanup

If the goal is removal from search results, noindex is the better option.

Noindex Is Stronger But Not Absolute

A noindex directive is an instruction in a page’s HTML that tells search engines not to index that page. Unlike robots.txt, which only controls crawling, noindex directly tells search engines “don’t include this in your index.”

There are two main ways to implement noindex:

Meta tag in HTML:

html

<meta name="robots" content="noindex, follow">

HTTP header (works for PDFs, images, non-HTML files):

X-Robots-Tag: noindex

The “follow” part means search engines can still follow links on the page to discover other content. If you want to block link following too, use:

html

<meta name="robots" content="noindex, nofollow">

Noindex is an obligation, not a request. When Google or Bing crawls a page with a noindex tag, they’re required to remove it from their index. This makes it far more reliable than robots.txt for de-indexing.

But there’s a catch: Search engines need to crawl the page to see the noindex directive. If you’ve blocked the page with robots.txt, they can’t crawl it to discover the noindex tag. This is why you should never combine robots.txt blocking with noindex—it creates a conflict that prevents removal.

Selective Bot Blocking: The Coupon Sites Strategy

Not all noindex tags are created equal. You can target specific search engines while allowing others. This is done with targeted meta tags:

html

<meta name="googlebot" content="noindex">
<meta name="bingbot" content="index, follow">

This tells Google not to index the page while allowing Bing to index it normally.

This traffic graph shows what happened to major coupon and voucher code sites in the UK during Google’s June 2024 algorithm update. Sites like Telegraph vouchers, Independent discount codes, and Express voucher codes saw 70-90% traffic drops virtually overnight.

What many people don’t remember is that before this crackdown, some major publishers were running selective noindex strategies. Reuters, LA Times, and other news organizations operated coupon sections that used <meta name="googlebot" content="noindex"> to hide from Google while ranking freely in Bing.

The logic was simple: Google was penalizing affiliate coupon content, but Bing wasn’t. So these sites blocked Google specifically while capturing Bing traffic. It worked for a while—until Google caught on and started penalizing sites that employed these tactics.

Most of those coupon sections were eventually deleted entirely rather than risk contaminating the main site’s rankings.

Example of selective blocking:

html

<!-- Block Google, allow others -->
<meta name="googlebot" content="noindex, nofollow">
<meta name="bingbot" content="index, follow">
<meta name="yandex" content="noindex">

This level of control can be useful for legitimate reasons—like blocking low-quality translated content from certain search engines while allowing it in others—but it should be used carefully. Search engines don’t look kindly on manipulation tactics.

When Noindex Works Best

Noindex is ideal when:

  • You control the website
  • The content is outdated, low-value, or should be private
  • You want the page accessible to users but not searchable
  • You’re cleaning up duplicate content issues
  • You need reliable de-indexing that search engines must honor

Common use cases include old press releases, internal tools, staging environments, thin content pages, or anything that shouldn’t rank but needs to remain publicly accessible.

Noindex Limitations

Noindex doesn’t work if:

  • You don’t control the site
  • The page is blocked by robots.txt (search engines can’t see the tag)
  • The page returns a 404 or 410 error (it’s already gone)
  • You need instant removal (noindex requires re-crawling)

Processing time varies. High-authority pages on frequently crawled sites may be de-indexed within 3-7 days. Lower-priority pages on smaller sites can take 2-4 weeks or longer.

The 403 Forbidden Nuclear Option

When noindex and robots.txt aren’t strong enough, server-level blocking provides absolute control. A 403 Forbidden response tells browsers and bots that access is permanently denied.

This is different from a 404 (Not Found) or 410 (Gone). Those codes say “this content doesn’t exist.” A 403 says “this content exists, but you’re not allowed to see it.”

Apache .htaccess example:

apache

<FilesMatch "private-content.html">
    Order Allow,Deny
    Deny from all
</FilesMatch>

Nginx configuration:

nginx

location /private-section/ {
    return 403;
}

Block specific user agents at server level:

apache

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|CCBot|PerplexityBot) [NC]
RewriteRule .* - [F,L]

403 responses are useful when:

  • Content contains sensitive information
  • You want absolute blocking regardless of who’s crawling
  • Noindex tags might be removed accidentally
  • You’re dealing with aggressive scrapers

The downside? 403 blocks everyone, including legitimate users. If the content needs to be accessible to some people (like logged-in users), you’ll need conditional logic rather than a blanket 403.

AI Bots Ignore Most Traditional Blocking Methods

This is where things get messy in 2026. Traditional search engines like Google and Bing generally respect robots.txt and always honor noindex. AI crawlers? Not so much.

Grok (xAI’s bot), confirming itself in my chat, ignoring robots.txt in practice, even though their official documentation claims they respect it.

Nick Kapur tweet from November 15, 2025:

“A number of websites with which I am affiliated keep constantly crashing because ever more malignant AI companies keep scraping them to train their LLMs and not only ignore the robots.txt file, but come up with ever more clever ways to circumvent blocks by disguising their bots as ordinary users.”

This is the reality content creators face. AI companies are in an arms race to gather training data, and many have decided that polite crawling rules don’t apply to them.

Common AI Bots and Their Behavior

GPTBot (OpenAI):

  • Official user agent: GPTBot
  • Claimed behavior: Respects robots.txt
  • Reality: Generally compliant when identified, but reports of undeclared crawlers

CCBot (Common Crawl – used by many AI companies):

  • Official user agent: CCBot
  • Claimed behavior: Respects robots.txt
  • Reality: Inconsistent; some reports of ignoring blocks

anthropic-ai / ClaudeBot (Anthropic):

  • Official user agents: anthropic-ai, Claude-Web
  • Claimed behavior: Respects robots.txt and crawl-delay
  • Reality: Better compliance than most, but heavy traffic volume

PerplexityBot:

  • Official user agent: PerplexityBot
  • Claimed behavior: Respects robots.txt
  • Reality: Cloudflare exposed them using stealth, undeclared crawlers to evade blocks

Google-Extended:

  • Official user agent: Google-Extended
  • Purpose: AI training (separate from regular Googlebot)
  • Behavior: Respects robots.txt blocks specific to Google-Extended

Other aggressive crawlers to watch:

  • Bytespider (ByteDance/TikTok) – Heavy crawler, often ignores crawl-delay
  • Amazonbot – Amazon’s crawler, can be aggressive
  • Meta-ExternalAgent – Facebook’s scraper
  • Applebot-Extended – Apple Intelligence training
  • Diffbot – Commercial web data extraction
  • SemrushBot – SEO tool, can generate heavy load
  • Others: Applebot, MJ12bot, Meta-webindexer

The common pattern? Official bots with declared user agents mostly respect robots.txt. But many AI companies run undeclared scrapers using generic browser user agents to avoid detection.

Why AI Bots Bypass Traditional Controls

Robots.txt is easy to ignore. There’s no enforcement mechanism. It’s a text file making polite requests. AI companies training models on billions of web pages have decided the competitive advantage of more data outweighs the social contract.

Noindex tags don’t apply to AI training. A noindex tag tells search engines “don’t show this in search results.” It says nothing about using the content for training language models. Many AI companies argue that training isn’t indexing, so noindex doesn’t apply.

Crawl-delay is widely ignored. The Crawl-delay directive in robots.txt asks bots to wait between requests. Most AI crawlers ignore this completely, leading to sites being hammered with requests that can cause slowdowns or crashes.

What Actually Works Against AI Crawlers

1. Rate limiting at the server level (Cloudflare or other CDN)

2. 403 Forbidden responses for known bot user agents

3. JavaScript challenges that bots can’t complete

4. Behavioral analysis that identifies and blocks scraping patterns

5. Legal action (some companies have sued AI crawlers successfully)

But realistically, there’s no perfect solution yet. The best defense is layered protection that makes scraping expensive and difficult.

Cloudflare Bot Protection: The Most Effective Defense

Unlike robots.txt, which politely asks bots to behave, Cloudflare enforces blocking at the infrastructure level. This is the single most effective tool against aggressive AI crawlers in 2026.

Why Cloudflare Works Where Robots.txt Fails

Cloudflare sits between your server and incoming traffic. Requests never reach your server until Cloudflare decides they’re legitimate. This means:

  • Blocking happens before your server gets hit (no performance impact)
  • You can’t be overwhelmed with scraping requests that crash your site
  • Bots can’t easily bypass by changing user agents
  • JavaScript challenges verify visitors are actually using browsers

Cloudflare Bot Protection Features

Bot Fight Mode (Free Plan):

  • Blocks known bad bots automatically
  • Simple toggle, no configuration needed
  • Good for basic protection

Super Bot Fight Mode (Pro/Business Plans):

  • Definitely automated traffic: Block or challenge
  • Likely automated traffic: Choose how to handle
  • Verified bots: Allow good bots (Googlebot, etc.)
  • JavaScript detection: Challenge suspicious traffic

Custom Rules (All Plans):

(http.user_agent contains "GPTBot") or 
(http.user_agent contains "CCBot") or
(http.user_agent contains "PerplexityBot")

Action: Block

Rate Limiting (Pro+ Plans):

  • Limit requests per IP address
  • Block IPs exceeding thresholds
  • Protect specific endpoints

Challenge Pages:

  • JavaScript challenges (most effective)
  • CAPTCHA challenges (backup option)
  • Managed challenges (automatic decision)

Setting Up Basic Bot Blocking in Cloudflare

  1. Enable Bot Fight Mode (Security > Bots)
  2. Create a custom firewall rule for known AI bots:
    • Security > WAF > Custom rules
    • Rule: (http.user_agent contains "GPTBot")
    • Action: Block
  3. Set up rate limiting for suspicious patterns:
    • Security > WAF > Rate limiting rules
    • When: Requests exceed threshold
    • Action: Block for 1 hour
  4. Monitor the Firewall Events log to see what’s being blocked

This layered approach stops most aggressive AI crawlers while allowing legitimate traffic through. It’s not perfect—bots using residential proxies and real browser user agents can still get through—but it dramatically reduces the problem.

Why This Matters More Than Ever

Nick Kapur’s tweet highlighted a real problem: AI companies crashing affiliate sites by scraping them so aggressively that servers can’t handle the load. This isn’t theoretical. Sites go down, lose revenue, and face increased hosting costs because bots ignore crawl-delay and hammer servers.

Cloudflare fixes this. Even if bots ignore your robots.txt, Cloudflare enforces limits that keep your site running. That’s why it’s become essential infrastructure rather than optional protection.

Google’s Removal Tool and What It Can’t Do

For content you control, Google Search Console offers removal tools that work faster than waiting for noindex to be processed.

The Removals tool in Search Console allows you to request urgent removal of content from Google Search. This is different from noindex—it’s a direct request to Google to hide specific URLs.

Types of Removal Requests

Temporary Removals:

  • Removes URLs from search for ~6 months
  • Use when you need immediate delisting while implementing permanent fixes
  • No impact on site’s overall standing
  • Requires site verification in Search Console

Outdated Content:

  • For pages that no longer exist or have changed significantly
  • Users can submit without owning the site
  • Takes 24-48 hours typically

SafeSearch Filtering:

  • Removes adult content from SafeSearch results
  • Useful for reputation cleanup involving inappropriate images

The Process

  1. Go to Search Console > Indexing > Removals
  2. Click “New Request”
  3. Enter the URL you want removed
  4. Select removal type (temporary removal, clear cached URL, etc.)
  5. Submit and wait for processing (usually 24-48 hours)

Google evaluates the request and either approves or denies it. If approved, the URL disappears from search results almost immediately.

What Removal Requests Can and Cannot Do

Removals can:

  • Remove specific URLs from Google Search
  • Hide pages quickly while you implement permanent solutions
  • Limit exposure during reputation crises
  • Work faster than waiting for crawl cycles

Removals cannot:

  • Delete content from the website itself
  • Remove content from other search engines (Bing, DuckDuckGo, etc.)
  • Guarantee permanent removal if policies change
  • Override legitimate news or public interest content
  • Remove content from AI training data or chatbot citations

The AI Citation Problem

This is the gap that exists in 2026: There’s no removal option for AI citations. You can remove a page from Google Search, but you can’t remove it from:

  • ChatGPT’s training data
  • Claude’s knowledge base
  • Perplexity’s answer citations
  • Google’s AI Overview answers (immediately)

Once AI models are trained on content, that information persists in the model. Even if the source page is deleted, de-indexed, and removed from search, AI chatbots may still reference information from it.

The only real solution is preventing AI crawlers from accessing content in the first place—which brings us back to Cloudflare-level blocking rather than noindex tags.

For anyone dealing with negative content, this creates a new challenge. Original de-indexation and removal basics still apply for search engines, but AI citations require different strategies.

What Actually Works: A Layered Approach for 2026

No single method solves every de-indexing problem. Effective cleanup uses multiple tools strategically based on what you control and what you’re trying to accomplish.

Step-by-Step Strategy

1. Audit what’s ranking

  • Identify pages you own vs third-party content
  • Check which are indexed in Google (use site:yourdomain.com search)
  • Determine if content is news, expired, or low-quality

2. Fix owned content first

  • Add noindex to pages you want de-indexed
  • Remove robots.txt blocks that prevent crawling
  • Ensure important pages aren’t accidentally blocked

3. Use removal tools for urgent situations

  • Submit temporary removal requests in Search Console
  • This gives immediate delisting while permanent changes take effect
  • Monitor status and resubmit if needed

4. Block AI crawlers at infrastructure level

  • Enable Cloudflare Bot Fight Mode or equivalent
  • Create custom rules for known AI bots
  • Set up rate limiting to prevent overload

5. Request removals for third-party content

  • Use Google’s “Remove outdated content” tool
  • File personal information removal requests if applicable
  • Contact site owners directly for takedowns when appropriate

6. Suppress what can’t be removed

  • Create positive content that ranks above negative results
  • Build authority pages that push unwanted results down
  • Consistent publishing over time beats trying to remove everything

Realistic Timelines

Noindex processing: 3-14 days depending on crawl frequency and site authority

Robots.txt changes: Effective immediately for new crawls, but doesn’t remove existing indexes

Google removal requests: 24-48 hours for temporary removals; varies for permanent

AI crawler blocking via Cloudflare: Immediate effect once rules are active

AI training data removal: Effectively impossible once models are trained

Content suppression: 3-6 months to see significant ranking changes

The fastest results come from Google’s removal tool combined with noindex. But for long-term cleanup, building positive content that outranks negative results is often more effective than trying to remove everything.

Common Mistakes That Waste Time

  • Blocking pages with robots.txt when you want them removed from search. This prevents search engines from seeing noindex tags and can actually delay removal.
  • Adding noindex to pages already blocked by robots.txt. The noindex tag can’t be read if crawlers can’t access the page.
  • Filing removal requests without checking eligibility. Google rejects requests that don’t meet policy requirements. Read the guidelines first.
  • Expecting instant results without follow-up. Noindex requires re-crawling. If your page isn’t crawled frequently, it takes longer.
  • Ignoring AI crawlers. In 2026, de-indexing from Google Search isn’t enough. AI bots need separate handling.

Small technical mistakes like these can add weeks to cleanup timelines. Getting the sequence right—remove robots.txt blocks, add noindex, submit removal request, implement Cloudflare blocking—ensures each step works properly.

FAQ: Common De-Indexing Questions

Is de-indexing permanent?

Not always. A page can be re-indexed if the noindex tag is removed, the page changes significantly, or Google’s policies change. Monitoring is essential to catch re-indexing early.

Can I de-index content I don’t own?

Not directly. For third-party content, you must rely on Google’s removal request tools (outdated content, personal information removal) or contact the site owner for takedowns. In some cases, legal action may be necessary.

How long does noindex take to work?

Usually 3-14 days, depending on how often Google crawls the page. High-authority sites with frequent crawling see faster results. Lower-priority pages may take a month or more.

Why are my pages indexed even though they’re blocked by robots.txt?

Because robots.txt controls crawling, not indexing. If Google finds links to your blocked URLs from other sites, it may index them without crawling the content. Use noindex instead.

Do AI chatbots respect noindex tags?

Not for training purposes. Noindex tells search engines not to show pages in search results. AI companies argue training models isn’t “indexing,” so noindex doesn’t apply. Most AI crawlers only respect server-level blocking (403 Forbidden) or infrastructure-level filtering (Cloudflare).

Should I hire help for this?

If the content involves legal risk, active news coverage, or appears across multiple platforms, professional guidance helps avoid mistakes and wasted effort. SEO consultants or reputation management firms can handle complex cases more efficiently.

Can I block Google but allow Bing?

Yes, using targeted meta tags like <meta name="googlebot" content="noindex"> while leaving Bing unrestricted. But be cautious—search engines penalize manipulation tactics. This should only be used for legitimate reasons, not to game different search engines.

Final Thoughts

De-indexing in 2026 requires understanding multiple systems that don’t always cooperate. Robots.txt and noindex still work for traditional search engines. AI crawlers need different approaches. Urgent situations call for removal requests. High-traffic sites benefit from Cloudflare’s infrastructure-level blocking.

The key is knowing which tool solves which problem. If you control the site and want reliable de-indexing, use noindex. If you’re dealing with aggressive AI scrapers, implement Cloudflare bot protection. If you need immediate removal from Google, use Search Console’s removal tool.

For comprehensive de-indexation and removal basics, combining these methods strategically produces better results than relying on any single approach. Focus on control first (what you can change directly), policy second (removal requests for third-party content), and suppression last (building positive content when removal isn’t possible).

That clarity turns a frustrating problem into a manageable plan.

Share This Article
Learning SEO since 2018. SEO Specialist Who Claims To Have Ranked 50+ Sites On 1st Page. I enjoy doing low difficulty keyword research, yes I have the skill to spy competitor keywords and grab ranking opportunities from them.
Leave a Comment