Thoughts of DS: Understanding Google Crawling & Indexing

Pierre Far (Webmaster Trends Analyst at Google) spoke on "Understanding Google Crawling & Indexing" at Think Visibility SEO conference at Alea Casino (Leeds) on 3rd March 2012.

I have tried to sum up the points he touched in his presentation (collected from various Blogs and tweets). Plus, I have added my own interpretation.

Google gets URLs by crawling, links, site maps and the add URL feature.

There are always more URLs than Google can fetch, so they try to get as many as possible without destroying your website. To do this they use a relaxed crawl rate.

Google increase the URL crawl rate slowly and see if response time goes up. If your site can’t handle the crawler they will not crawl much of your site.

Google checks Robots.txt only about once per day to help keep the load off your server? Having a +1 button on your site can override robots.txt? Both these points are interesting to me.

Google sets a conservative crawl rate per server. So too many domains or URLs will reduce crawl rate per URL. If you use shared hosting, then this could easily be problematic for you. If you do not know how many other websites are on the same IP-Address as you, then you may be surprised. You can easily check this by putting your domain or IP-Address into Majestic’s neighborhood checker to see how many other websites are hosted on the same IP-Address. If one shared site on the same IP has large number of URL and it is not yours, then you could be losing crawl opportunities, just because there’s a big site that isn’t connected to you in any way on the same IP. You can’t really go complaining to Google about this. You bought the cheap hosting, and this is one of the sacrifices you made.

Google crawl more pages than those in your sitemap but it does help them decide which pages are more popular.

If a CMS has huge duplication, Google then knows, and this is how it notifies you of duplicates on GWMT. This is interesting because it is more efficient to realize a site has duplicate URLs at this point than after Google has had to analyze all the data and deduplicae on your behalf. Google then picks URLs in a chosen order. One important to choose one page in comparision to other is Change Rate of page content.

Googlebot can be blocked from accessing your server, so you need to make sure your hosts have no issues or they will think your site is down. Biggest and smallest ISPs can block Googlebot at the ISP level. Because ISPs need to protect their bandwidth, the fact that you want Google to visit your site does not necessarily mean it will be so. Firewalls at the ISP may block bots even before they see your home page. They may (more likely) start throttling bits. So if your pages are taking a long time to get indexed, this may be a factor.

Strong recommendation – set up email notifications in Web Master Tool. Setup email forwarding on webmaster tools as a priority – this is very important so you don’t miss any error messages.

Make sure your 404 page delivers a 404 status – or it will get indexed which happens a lot. Soft error pages create an issue and so Google tries hard to detect those. If they can’t, they end up crawling the soft error as a crawl slot (at the expense of another URL crawl, maybe). So if you don’t know what a soft error is, it is when an error page returns a 200 response instead of a 404 response. You can use Firefox add-on Live http header to check this.

Google has to pick the best URL and title for your content. They can change it to better match the query. They then generate a snippet and site links. Changing them improves the CTR. It’s as if you are writing a different title for each query.

If server spikes with 500 errors, Googlebot backs off. Also, firewalls etc can block the bot. This can after a few days, create a state in Google, that says the site is dead. If Googlebot gets 503 error on robots.txt they stop crawling. Be careful, if only some part of your site is offline, do not to serve a 503 on robots.txt.

Googlebot is getting better and better at seeing JavaScript / ajax driven sites and pages.

For displaying result, Google needs to:
Pick a URL
Pick Title: Usually Title Tag, sometimes changes tag based on user query. This is win win for everyone.
Generate Snippet: Will create stuff on page, but strongly recommends using Rich Snippets.
Generates Site-links: depends on query and result as to whether this appears. If you see a bad site-link issue (wrong link) check for canonicalisation issue.

Pierre pointed out that all this is in the Google Webmaster Documentation - http://support.google.com/webmasters/?hl=en

11 Mar 2012

Understanding Google Crawling & Indexing

1 comment: