Our Blog

where we write about the things we love

12

Jan

Search and ye shall (maybe) find

If people can’t find your website using Google, Windows Live or Yahoo!, your web presence is effectively useless. Regardless of how much you spend on your website, and how much you promote it, virtually all sites are reliant on the mainstream search engines to drive visitors and assure a site’s success.

During some regular monitoring of our Intergen website, at www.intergen.co.nz, we noticed a recent, sudden drop in traffic. Despite having several hundred pages, that typically generate hundreds of page views a day, we noticed a significant decline in page views; we also noticed that our most trafficked pages were pages that were previously not that popular.

Were our assumptions about what people look at on our site wrong, or was there some other factor skewing results? What we discovered was less than obvious; here’s how we approached the problem, and how it was solved.

First, we used Google’s Webmaster Tools to help us understand what parts of our site were in the index, and make sure that Google was indexing everything – with more than 65% of the global search market, Google is a key search engine to get right.

We created and uploaded sitemap.xml and robots.txt files to help instruct Google (and the other engines) what pages were on the site, and how much of the site the engines should index. We got mixed results: the pages were being found by Google, indicating the sitemap and robots.txt files were working, but for some reason Google was unable to index the content of the site.

Some under the hood detective work was required by our technical team. By applying some troubleshooting smarts, we found that the Googlebot indexer was receiving Server 500 errors when indexing the Intergen website, indicating an “unexpected condition which prevented it from fulfilling the request.” Interestingly, while Yahoo behaved similarly to Google, the Microsoft search robot was not receiving errors; our entire website was being successfully accessed by users; and many pages on the site were being indexed by Google. The issue was ensuring the whole site could be indexed by Google, the pre-eminent search engine.

Browser Detection Battles

Our web site is hosted on the popular EPiServer Content Management platform, and utilises .NET 2.0; something we updated recently, which we believe may have, at least partially, led to this problem. We use URL rewriting to make a large amount of our links appear more friendly to end users, such as http://www.intergen.co.nz/Hotlinks/Twilights/.

Some research ensued, and we discovered that since March 2006, Googlebot is identified as Mozilla 5 compatible browsers and, as a result, Googlebot breaks the browser detection pattern in ASP.NET 2.

Previously, Googlebot was identified as a default browser in the browser detection process, when it identified itself as the GoogleBot/2.1 useragent. Being identified as such, it fell into the default browser bucket with associated, reasonable capabilities.

Subsequently, Google made some enhancements so that their agent was accepted by more web servers; now the Googlebot is identified by web servers as a Mozilla/5.0 useragent – and this caused some problems. The detection pattern finds there is no generic Mozilla/5.0 useragent, and instead uses the closest match, which is the Mozilla/1.0 browser (which has low browser capabilities). Using Mozilla/1.0 browser settings meant that the Googlebot robot was running at an incompatible level with our site and, in particular, could not handle the URL rewriting we were using.

The upshot: some ASP.NET 2.0 websites, including EPiServer sites with friendly URL writing, don’t get crawled correctly by Google (as indicated by the large number of Server 500 error pages in the logs). And as a result, rankings drop within the Google search engine because the Googlebot can’t access the pages correctly.

This issue doesn’t just apply to EPiServer sites using URL rewriting, but applies more widely to all .NET 2.0 websites that use functions that are not compatible with a Mozilla/1.0 browser. This issue may or may not also occur in ASP.NET 3.x, depending on whether the browser detection functions know how to handle Mozilla/5.0 (to test compatibility, use Firefox’s User Agent Switching capabilities).

To explain the other search engines, and their robot browsers, Yahoo Slurp was always identified as Mozilla 1.0 and just wasn’t noticed, and the Microsoft browser robot was unsurprisingly always detected correctly by the Microsoft .NET 2.0 browser detection process.

So is this an issue to worry about? We believe it is. While many people aren’t concerned about their search rankings, believing that they are either a technical mystery or bear no relation to their business, the reality is that without the help of Google and the other search engines, your website is not being found by as many people as possible. For those of us who have invested significant time, effort and funds into our sites, we want to maximise the return on this investment.

The Resolution

Thankfully, there is a resolution to this issue that can be applied immediately, is simple to implement, is non destructive, and doesn’t require your code to be recompiled.

Broadly, we want to add a generic Mozilla 5.0 useragent definition to the detection process. This resolution should be treated as an interim fix, however, until the browser detection pattern correctly recognises a Mozilla/5.0 useragent. (That said, this issue has existed since March 2006, so it’s debatable whether a permanent fix will appear given how long the issue has been around.)

To fix this issue, add an “/App_Browsers” directory in the root of your site, and add a ‘genericmozilla5.browser’ file to this directory that contains the following:

<browsers> 
<browser id="GenericMozilla5" parentID="Mozilla"> 
<identification> 
<userAgent match="Mozilla/5\.(?'minor'\d+).*[C|c]ompatible; ?(?'browser'.+); ?\+?(http://.+)\)" /> 
</identification>
<capabilities> 
<capability name="majorversion" value="5" /> 
<capability name="minorversion" value="${minor}" /> 
<capability name="browser" value="${browser}" /> 
<capability name="Version" value="5.${minor}" /> 
<capability name="activexcontrols" value="true" /> 
<capability name="backgroundsounds" value="true" /> 
<capability name="cookies" value="true" /> 
<capability name="css1" value="true" /> 
<capability name="css2" value="true" /> 
<capability name="ecmascriptversion" value="1.2" /> 
<capability name="frames" value="true" /> 
<capability name="javaapplets" value="true" /> 
<capability name="javascript" value="true" /> 
<capability name="jscriptversion" value="5.0" /> 
<capability name="supportsCallback" value="true" /> 
<capability name="supportsFileUpload" value="true" /> 
<capability name="supportsMultilineTextBoxDisplay" value="true" /> 
<capability name="supportsMaintainScrollPositionOnPostback" value="true" /> 
<capability name="supportsVCard" value="true" /> 
<capability name="supportsXmlHttp" value="true" /> 
<capability name="tables" value="true" /> 
<capability name="vbscript" value="true" /> 
<capability name="w3cdomversion" value="1.0" /> 
<capability name="xml" value="true" /> 
<capability name="tagwriter" value="System.Web.UI.HtmlTextWriter" />  </capabilities> 
</browser> 
</browsers> 

By applying the fix and waiting for one day, we were able to confirm using Google that the links within our site were being indexed correctly, while Yahoo was also able to index our site.

Posted by: Ashley Petherick | 12 January 2009

Tags: Search, Google


Related blog posts


Top Rated Posts

Blog archive

Stay up to date with all insights from the Intergen blog