How to Deindex Pages from Google

Googlebot is good at its job. It masterfully navigates the web’s nooks and crannies to crawl webpage after webpage. Sometimes it’s a little too good at its job, and it crawls and indexes pages that we never meant for it to access in the first place, let alone surface in the SERPs. It happens to the best of us, so here’s a guide on how to deindex pages from Google.

Click below to jump to the section that matches your scenario:

How to Deindex a Single Page

Example Scenario: Let’s say you have a page that snuck its way into your XML sitemap and got subsequently crawled and indexed. This page isn’t doing anyone any good by being accessed by search engines or users from the search results, so we’d like to remove it from Google’s index.

  • Apply a robots noindex meta tag within the <head> of the single page

<meta name=”robots” content=”noindex”>

  • Do NOT block the page via the robots.txt file
    • Google needs to be allowed to crawl the noindex tag in order to receive the directive that it should be removed from the index. If Google can’t see that the page shouldn’t be indexed, it won’t remove the page.
  • Remove the page from the XML sitemap

How to Deindex a Site Section

Example Scenario: Perhaps you’ve got an ecommerce site that has product category pages with filters. When filters are applied, the URL changes, but the content is merely narrowed or refined to match the filter parameters, and you do not want these pages to be indexed.

Deindexing Filtered Content via URL Parameters

Example Scenario: The site section you’d like to have deindexed is duplicate content to other, more authoritative pages on your site, but is necessary to e.g. the site’s information architecture and subsequent user navigation; it should not, however, be accessible to search engines to index or to users in the search results. In addition, the redundant section is being crawled frequently and thus taking up ample crawl budget that would be better served across other site pages.

  • Apply a robots noindex meta tag within the <head> of every page within the section

<meta name=”robots” content=”noindex”>

  • Search for the site section’s pages in Google Search using the following site: search format and see if there are any results returned within that folder:
  • If results are returned:
    • Check back at a later date and perform the site: search again until no results are returned.
      • Do NOT block the page via the robots.txt file
        • Google needs to be allowed to crawl the noindex tag in order to receive the directive that this section should be removed from the index. If Google can’t see that the section shouldn’t be indexed, it won’t remove the section.
  • If results are NOT returned:
    • Add a Disallow line to the test site’s robots.txt file that blocks ONLY that section from being crawled
      • For example, add the following line to the https://www.example.com/robots.txt page:
        • Disallow: /test-section/
  • Test the robots.txt disallow line in Google Search Console’s robots.txt Tester feature.
    • Make sure that ONLY that section is blocked

How to Deindex a Staging Site or Test Site

Example Scenario: There is a staging site to test out website changes before pushing them to the live site. In an effort to make it resemble the live site as closely as possible, it has all the bells and whistles that make it attractive for Google to crawl and index.

Step 1: Get the Site Removed from Google’s Index

  • Apply a robots noindex meta tag within the <head> of every page of the staging site

<meta name=”robots” content=”noindex”>

Step 2: Prevent Google from Crawling it in the Future

We actually don’t want search engines to access a staging site at all because of duplicate content implications. We do need to make sure that Google can crawl the noindex tags before we block it from being crawled entirely. If Google can’t see that the pages shouldn’t be indexed, it won’t remove the pages.

  • Search for your site in Google Search using the following site: search format and see if there are any results returned:
  • If results are returned:
    • Check back at a later date and perform the site: search again until no results are returned.
  • If results are NOT returned:

Troubleshooting

How to See if a Site is Indexed

The simplest way to check this is by performing a site: search in Google.

  • Search for your site in Google Search using the following format, without a space after the colon, and see if there are any results returned:

How to See if a Page is Indexed

  • Perform a site: search for the individual page in Google:

Google Search Console Remove URLs Tool

  • If you have an urgent need to remove a page or set of pages from the index, this is a good option; however, it’s important to note that this is a temporary method and is only valid for 90 days.
    • In order to use the tool, simply enter the specific URL or the URL category into the tool in Google Search Console and click one of the following options:
  • Clear URL from cache and temporarily remove from Search
    • This option is the most related to the deindexing topic as it will  block the specific URL from search results and block it from search engines for 90 days.
  • Clear URL from cache only
    • This option is used to remove the page and snippet from Google’s cache for 90 days but will keep it in the search results.
  • Clear cache and temporarily hide all URLs beginning with…
    • This option will remove all pages beginning with that specific URL folder for 90 days

Example Scenario: “I don’t think Google has crawled my single page yet that I want deindexed!”

  • Make sure that the page is not blocked by the robots.txt file
    • Use the Google Search Console robots.txt feature to paste in your URL and see whether it’s been BLOCKED or ALLOWED:
  • While you’re in Google Search Console, give the bots a little extra nudge to crawl the newly added noindex tag by using the Fetch as Google feature:
  • Don’t click “Request Indexing”

Log Files, Crawling, & Indexation

  • Log files can be a useful method for understanding what pages search engines are accessing in order to see how effective your methods of allowing or blocking the crawl actually are.
    • From our experience with log file analysis, we’ve found that, while by no means perfect, Google follows directives the most faithfully of all the search engines.

Best Practices & Tips

Staging Site

  • Don’t allow staging or test sites to be crawled in the first place.
    • Prior to making a test site live, add in the Disallow All directive to the robots.txt file
      • Add the following line on the bottom of the robots.txt page:
        • Disallow: /

Deindex & Disallow Crawls with Caution

  • Don’t accidentally deindex or block your entire site or key pages from being crawled.
    • Use internal checks and balances when adding in any meta robots or robots.txt directives.
      • For example, use a crawling tool like Screaming Frog to check the site’s meta robots directives to ensure they are only on the pages you want deindexed and not applied across the whole site or on key pages.

Prevention vs. Damage Control

Ultimately, the best kind of indexation damage control is to prevent it from happening in the first place. Google has an exceptionally long memory, so once pages are crawled, Google has a hard time forgetting about them. With that being said, websites are dynamic entities with lots of involved parties and well, stuff happens. Sometimes fixing any potential damage is a necessary option and, thankfully, an available one. Search engines want to understand your website and by proactively pointing them in the right direction, the web can become a better place for one and all.

Related Posts

Leave a Reply

Your email address will not be published.