Thoughts of DS: Pages prohibited by robots.txt gets indexed in Search Engine

My Friends from SEO World keep on asking me following question from time to time:

Why is my url showing up in Google when I blocked it in robots.txt? It seems that Google Craws the disallowed urls.

Lets take a case from a popular B2B Portal http://dir.indiamart.com.

Now, lets see what happens when we search for a specific url from the disallowed /cgi directory.

Google has 360 pages from that "disallowed" directory.

How could this happen? The first thing to note is that Google abides with your robots.txt instructions - it does not index text of those pages. However, the URL is still displayed because Google found a link somewhere-else as:

<a href="http://dir.indiamart.com/cgi/compcatsearch.mp?ss=Painting">Painting Manufacturers & Suppliers - Business Directory, B2B...</a>

Google hasn't crawled these URLs, so it appears as an URL rather than a traditional listing.

Also, because it found the link with anchor-tag "Painting Manufacturers & Suppliers - Business Directory", it associated the listing with it.

In addition, Google can show a page description below the URL. Again, this is not a violation of robots.txt rules — it appears because Google found an entry for your robots.txt disallowed page / site in a recognized resource such as the Open Directory Project. The description comes from that site rather than your page content.

The robots.txt tells the engines to not crawl the given URL but tells them that they may keep the page in the index and display it in results (see the snapshot above – in the snapshot you will notice that there is no snippet).

This becomes a problem when these pages accumulate links. Those pages then can accumulate link juice (ranking power) and other query-independent ranking metrics (like popularity and trust) but these pages can't pass these benefits to any other pages since the links on them don't ever get crawled.

This is further more elaborated from a SeoMoz cartoon below (courtesy: Robots.txt and Meta Robots ):

This means in order to exclude individual pages from search engine indices, the noindex meta tag <meta name="robots" content="noindex, follow"> is actually superior to robots.txt.

Blocking with Meta NoIndex tells engines they can visit but they are not allowed to display the URL in results.

Matt Cutts explains in a WebMastersHelp Video titled: "Uncrawled URLs in search results" about "why a page that is disallowed in robots.txt may still appear in Google's search results".

A SitePoint Article Why Pages Disallowed in robots.txt Still Appear in Google may also be worth reading in this regard.

I did all this research for my own purpose. But, thought of sharing it, just in case it helps others.

1 Jan 2012

Pages prohibited by robots.txt gets indexed in Search Engine

2 comments: