Doing battle with Mambo and Google Sitemaps
I think I’ve just managed to do something that has been baffling me for 2 or 3 days now, that is, to get Google sitemaps to verify a Mambo based web site that I’m building for a friend. The problem was that we were using the SEF (search engine friendly) component in Mambo to deliver nice, static looking URLs in preference to the dynamic type that Mambo produces. An example:
http://www.theopenconsultinggroup.com/content/view/29/51/ -
This is a page of content from the site after SEF has done it’s business.
http://www.theopenconsultinggroup.com/index.php?option=com_content&task=view&id=29&Itemid=51 -
This is the same page as published by Mambo without SEF enabled.As you can see, there’s a big difference, and Google notices the difference. From my reading, the former URL is far more preferable to GoogleBot since it’s not plagued by question marks and ampersands, so that what we’ll provide. Enable SEF in the Global configuration, rename htaccess.txt to .htaccess and all URLs are rewritten nicely.
Is that the end of it? Not really. The home page for the OCG is pretty static, with a javascript menu as the topmenu. This is generated from the database dynamically and it works well - as you add pages, they appear without having to change anything as long as you assign them a place in the menu. The problem with this is that Google (and it seems most search engine) don’t like javascript links - they prefer a nice standard text link with alt text. I could change the system, and might still do, to a CSS based template but in the meantime I thought I’d suggest a sitemap to Google to see if it would trawl it.
Having signed into with Google sitemaps with my regular Google account, I added the Open Consulting Group as a project and used the filename supplied to create an empty file in the root of the web site. Then I tried to verify my access with Google Sitemaps. Nothing. I left it for a few days to see if anything would happen, but the GoogleBot just wasn’t verifying. Crapola.
Finally tonight I had a play with the .htaccess file in the mambo root directory. I commented out the rewriting conditions and rules temporarily and ran a verify with Google again and instantly we were verified - still not being crawled (yet) by the bot, but at least we’re in the picture. I uncommented the conditions and rules after verification and we’re back to where we were. The problem was with the SEF feature of always returning some content, ever if the URL you requested is invalid. As an example, here’s the bot looking for my unique file to verify that it’s my web site…
66.249.72.45 - - [03/Jan/2006:21:41:30 +0100] “HEAD /google66b6xxxxxxxxx51x.html HTTP/1.1″ 200 0 www.theopenconsultinggroup.com “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-”
So far, that’s good. It sees that the file is there. Now comes trouble.
66.249.72.45 - - [03/Jan/2006:21:41:32 +0100] “HEAD /GOOGLE404probeddb8db1a52266114.html HTTP/1.1″ 200 0 www.theopenconsultinggroup.com “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-”
Two seconds later it tried to retrieve what must be an invalid URL from the site, but it gets a status of 200 - OK. The SEF function is rewriting the request to the homepage and Apache is serving something up - this is a no-no in terms of the verification process. The fact the Google knows there there isn’t a web page with that name, yet one is being returned (due to SEF) stopped the verification.
I disabled the rewriting by commenting out the following three lines in the .htaccess file:
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*) index.php
These state that “for filenames or directories that don’t exist, just go to index.php” (in a nutshell). Commenting these out, then re-verifying, ensured the the sitemaps verification went through (for now).
So there lies the problem. How do I get Google to verify my right to have a sitemap on the OCG homepage, yet still be able to run SEF within Mambo. I suspect I need a little mod_rewrite voodoo, I just need to figure it out or invoke the Lazyweb.