It ain’t easy being a search engine bot: crawling all day and night; trying to organize, categorize, and understand the quality of each page on the internet; analyzing language, context, and themes. But a bot don’t stop! “Please sir, I want some more” – this is the voice of a bot, craving more content from your site! At some point, we need to stop and ask ourselves: what if we have a page on our site, and a bot isn’t able to crawl it? This is the plight of an orphaned page. So, let’s find them and fix them!
What is an Orphan Page?
An orphan page simply does not have any internal links pointing toward it. The name itself defines it: a page without a parent.
Orphan pages commonly occur by human error; a missing or faulty link to a page can simply make it unreachable for a search engine to crawl, which is undoubtedly problematic for SEO. For example, during a website redesign, you could unknowingly remove links to an old page, but the page itself lives on. Sometimes under the right circumstances, an orphan page can be perfectly fine: these pages are commonly created for a specific PPC or email campaigns and are purposefully not linked outside of the campaign. No matter how it happens, it’s important to maintain a good understanding of how your website is internally connected.
Are Orphan Pages Bad?
Where there is light, there is always dark, you know, to counterbalance the universe. While orphan pages themselves are not inherently bad, these pages can also be mistaken for doorway pages.
Doorway pages all have very similar content, tweaked slightly to target variations of a keyword. This is commonly seen in a situation where a company may want to target hundreds of different cities with individually targeted pages, where only the city name and state is interchangeable throughout each page, (i.e ‘Best Home Security in City, State’.)
To quote the Google Quality Guidelines:
“Doorways are sites or pages created to rank highly for specific search queries. They are bad for users because they can lead to multiple similar pages in user search results, where each result ends up taking the user to essentially the same destination. They can also lead users to intermediate pages that are not as useful as the final destination.”
In the spirit of providing quality search results, Google does not want to see pages that are built purely to focus on a myriad of small keyword variations.
If an orphan page contains overly targeted or similar content to other pages and is submitted in an XML sitemap, without a noindex meta tag, this runs the risk of being mistaken for a doorway page. This page would now be on Google’s radar as something that you requested to be indexed, but not part of your site architecture. To Google, that’s a red flag, and red flags can lead to penalization, algorithmically or manually.
Since doorway pages can lead to a Google penalty, non-doorway orphan pages should always be avoided when possible. However, some pages may even be orphaned purposefully, in order to create landing pages for PPC or email campaigns.
For PPC and email campaigns, they’re often campaign-specific so marketers don’t want anyone to navigate to them outside the context of the campaign. PPC/email landing pages often feature a stripped-down design that drives users toward one specific goal, so may not include the full link architecture of the site. This is a totally valid reason to have them; you should simply ensure that a noindex meta tag is applied:
<meta name="robots" content="noindex">
Always remember that a user or search engine should be able to navigate to every page on the site. Any pages that fall out of that scope should probably have a clear directive that you don’t want that page indexed.
In addition to the doorway page risk, orphaned pages don’t receive much in the way of internal link equity – so if you’re creating a page that you hope to see rank well organically in search engines, it’s important that it not be orphaned for both discoverability and authority reasons.
How Do I Find Orphan Pages?
Ultimately you will always need to compare two sets of URL data in order to find orphaned pages. If it helps you can go full-on with the analogy here by thinking of these two URL data sets as each parent:
- URL Data Set 1: All of the page URLs ever created for your website.
- URL Data Set 2: All of the page URLs that can actually be crawled.
The discrepancy between the two URL data sets should uncover all of the orphaned pages on your website.
Find All Your Pages Using Log Files
The easiest way to get your log files is logging into your cPanel and find an option called Raw Log Files. If you are still not able to find it, you may need to contact your hosting provider and ask them to provide the log files for your site.
Raw Access Logs allow you to see what the visits to your website were without displaying graphs, charts, or other graphics. You can use the Raw Access Logs menu to download a zipped version of the server’s access log for your site. This can be very useful when you want to quickly see who has visited your site.
Raw logs may only contain a few hours’ worth of data because they are discarded after the system processes them. However, if archiving is enabled, the system archives the raw log data before the system discards it. So go ahead and ensure that you are archiving!
Once you have your log file ready to go, we now need to gather the other data set of pages that can be crawled by Google, using Screaming Frog. Alternatively, you can:
Find All Your Pages in WordPress
A useful WordPress plugin, quite aptly named Export All URLs, can help you to export all of the pages, posts and custom post types on your Content Management System. By exporting all of these pages from your WordPress site you will be able to compare and contrast against a list of the pages that were found when you crawled your site. Should there be any outliers, you will have uncovered pages that were not found with a site crawl.
From here you can simply evaluate whether there are pages that should be part of your site and incorporate them back in by simply linking to your orphan page from a page that you know has been crawled and is accessible by a bot.
Please back up your database before installing and activating any plugins.
- Install and activate the Export All URLs plugin.
- Select all types (pages, post and custom post types)
- Select all additional data (URL, Titles, Categories)
- Post Status: Published
- Export Type: .csv
Once you have all your WordPress pages, we now need to gather the other set of URLs that can be found when crawled, using Screaming Frog SEO Spider.
Crawl Your Pages with Screaming Frog SEO Spider
Screaming Frog is a fantastic tool that we hold in high regard here at UpBuild. Using the Screaming Frog SEO Spider, we can crawl our website as Googlebot would, and export a list of all the URLs that were found.
- Once you have Screaming Frog ready, first ensure that your crawl Mode is set to the default ‘Spider’.
- Then make sure that under Configuration > Spider, ‘Check External Links’ is unchecked, to avoid unnecessary external site crawling.
- Now you can type in your website URL, and click Start.
- Once the crawl is complete, simply
- a. Navigate to the Internal tab.
- b. Filter by HTML.
- c. Click Export.
- d. Save in .csv format.
Uncovering Orphan Pages
Now we should have our two sets of URL data, both in .csv format:
- All of the page URLs ever created for your website from log files or WordPress.
- All of the page URLs that can actually be crawled from Screaming Frog.
All we need to do now is compare the URL data from the two .csv files, and find the URLs that were not crawlable.
Uncover Orphan Pages for WordPress Using Spreadsheets
If you gathered your data from WordPress you can use a spreadsheet to uncover the discrepancies:
- Open a new spreadsheet in Microsoft Excel, Google Sheets, or your spreadsheet tool of choice.
- Place all the URLs from the WordPress .csv into column A.
- Place all the URLs from the Screaming Frog .csv into column B.
Should there be any orphan pages you will notice that one column (most likely column A) contains more URLs.
Next, we remove all duplicate values from column A and column B:
Using Microsoft Excel:
Simply use the Remove Duplicates command in the Data Tools group on the Data tab. Then follow the on-screen instructions to select and remove duplicates from the columns containing the two sets of URLs. Once the removal of duplicate URLs is complete, all of the URLs that remain are your orphan pages!
Using Google Sheets:
With Google Sheets we can use a =VLOOKUP formula to point out the URLs that were not found with our crawl, which in this case would be our orphan URLs.
- Open up a new Google Sheet.
- Place the data URLs from your WordPress .csv in column A.
- Note: It is important that the WordPress data is in column a, as the VLOOKUP formulates using the data in the leftmost column, for all succeeding columns.
- Place the data URLs from your Screaming Frog .csv in column B.
- In column and cell, C1 simply type the following:
- =VLOOKUP(A1,A:B,2,0)
- Then drag cell C1, all the way down as far as the amount of URLs that are in the spreadsheet.
- Wherever VLOOKUP returns “N/A” that indicates an orphan page (one that was found in your WordPress .csv, but not your Screaming Frog .csv).
Skip ahead to learn what to do when you find orphan pages, or:
Uncover Orphan Pages Using Screaming Frog Log File Analyser
If you decided to analyze a log file instead, we can use the Screaming Frog SEO Log File Analyser to uncover our orphan pages.
The program is very easy to use and as you can see from the image below, we have the ability to import the two data sets that we need to analyze. Referred to here simply as Log File and URL Data (this would be our Screaming Frog SEO Spider .csv).
- Import Log File.
- Import URL Data (Screaming Frog SEO Spider)
- Navigate to the URLs tab:
- Change your view to Not in URL Data. This will show you all of the URLs that were found in the Log File, but not in the crawled data.
What to Do When You Find Orphan Pages
Once you have your list of orphan pages, all you need to do is determine the value that each orphan page holds:
- If you want to keep a page, then adopt it!
- Internally link to your orphan page from a page that you know is already accessible by users and bots. Think about your users; where would this orphan page naturally fit and provide value to my user.
- Ensure that your new page is added to both your HTML sitemap and your XML sitemap.
- If you don’t want to keep a page, then remove it and 301 redirect!
- If an orphan page has thin content, duplicate content, or no value then you can simply remove the page altogether.
- Note: Be mindful that you should provide a 301 redirect for the orphan URL from this page to the next most relevant page, as it could possibly be accessible from an external source.
- If an orphan page has thin content, duplicate content, or no value then you can simply remove the page altogether.
- If you want to keep a page orphaned, then noindex it!
- Understandably you may have pages that you simply don’t want as part of the user journey. In which case you will simply want to ensure that your page has a clear noindex meta tag.
Verification
Once you have decided and implemented one of the three choices above for each of your orphan pages, you should run through the entire process again. This time you should be able to ensure when comparing the two URL data sets, that all of your outlying pages contain a noindex meta tag. If you have pages still without this directive then simply choose an option from above, until all your pages are firmly given a permanent home.
Lastly, you will want to make sure that all your hard work is on Google’s radar.
Note: If you are using the Yoast SEO plugin to manage your XML Sitemap, then your newly adopted pages should automatically be included and the next few steps should not be necessary. However, if you still don’t have Google Search Console setup for your WordPress site then Yoast provides some specific instructions on how you can achieve that when using their plugin.
For those of you who don’t use Yoast SEO plugin, go ahead and open up the Screaming Frog SEO Spider one more time, so we can crawl our website again, and create a shiny new XML sitemap.
- Type in your website URL, and click Start. Once your crawl reaches 100%, simply choose Sitemaps from the menu then Create XML Sitemap.
- This will open up a number of sitemap configuration options. However, since the default XML sitemap export settings are as they should be, and only include HTML pages included in the ‘internal’ tab with a ‘200’ OK response from the crawl, you can go ahead and click Next.
- Once you have saved and downloaded the XML sitemap, you can submit it to Google Search Console. This will help you with tracking the indexation of your pages and is a direct way to let Google know that you have pages you want to be indexed. That’s it!
In Summary: Adopt Don’t Block
Googlebot crawls the pages on our sites by using internal links we have created. Always be mindful that if a user can’t access it, then it’s likely that neither can Google. Auditing for orphan pages can help you uncover valuable pages, and maybe even avoid penalization.
Think of your website like a tree. It can be easy to assume that all pages are accessible, but if you cut off a branch, you may find that you have removed more limbs than you think. So when your website is undergoing any sort of architectural change, be sure to audit for orphan pages. Adopt, don’t block.
Comments