Robots.txt File – It Doesn’t Do Everything
Yesterday we received a support email from an ecommerce client that asked the following question:
Even though my robots.txt file has my cart and customer registration pages disallowed, Google still has the links to view my cart and to log into customer registration indexed. Why?
It was a good question, and one that all of us at LexiConn learned more about.
Robots.txt file – What it does
One of our past blog posts, All About Robots.txt Files, has a good explanation about how best to use this file on your website. In a nutshell, this file tells search engines what content *NOT* to crawl and catalog on your website. If you have pages that contain information you don’t want people viewing in Google (i.e. shopping cart page, customer login area, etc…) you put these URL paths in this file.
All the major search engines will check for a robots.txt file before spidering your site. They will exclude the URLs you list in this file (sort of).
What a robots.txt file does not do
In the support request our client sent us yesterday, we found that the view cart and customer login pages were listed in Google (on about page 16 of the search results). No snippet of information was included, but the links themselves were there.
It appeared to the client that the robots.txt file was not being honored by Google.
In actuality, a robots.txt file is not a 100% fool-proof way to exclude a page or directory from Google. Google states this in their webmaster central help articles:
Note that in general, even if a URL is disallowed by robots.txt we may still index the page if we find its URL on another site.
So if Google finds a link to content that is listed in robots.txt that is on another website/domain, that link itself will be included in Google.
Bummer. But there are two easy ways to get it removed and not re-indexed.
Two ways to remove a link from Google
1. The noindex meta tag
Using a simple meta tag in the <head> region of your pages you do not want indexed will force Google (and Bing) to not list the page at all. From Google’s help docs:
To entirely prevent a page’s contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.
Here is an example of this code:
<meta name=”robots” content=”noindex”>
But there’s a catch…
Also, if you’ve used your robots.txt file to block this page, we won’t be able to see the tag either.
If the above scenario applies to your situation, then you’ll have to go with the second option.
2. Request removal directly from Google Webmaster Tools
Google provides a simple way to tell them to remove a page from their search results. Here are the steps:
- Verify your ownership of the site in Webmaster Tools.
- On the Webmaster Tools home page, click the site you want.
- On the Dashboard, click Site configuration in the left-hand navigation.
- Click Crawler access, and then click Remove URL.
- Click New removal request.
- Type the URL of the page you want removed from search results (not the Google search results URL or cached page URL), and then click Continue. How to find the right URL. Note that the URL is case-sensitive—you will need to submit the URL using exactly the same characters and the same capitalization that the site uses.
- Click Remove page from search results and cache.
- Select the checkbox to confirm that you have completed the requirements listed in this article, and then clickSubmit Request.
Google will then drop this URL entirely. It may take a while for this to happen, but it will be removed.
…
For our client’s case, since the URL was already in their robots.txt file, requesting removal via Google’s Webmaster Tools was the correct solution.
Hopefully this helps shed some light on how search engines interact with a robots.txt file, a noindex meta tag, and a manual way to get a URL removed entirely. If I missed anything, let me know in the comments.
Looking for a web host that understands ecommerce and business hosting?
Check us out today!
One Comment