Hello everyone out there,
here’s a quick one.
I’ve noticed that Google wouldn’t index the pages that were recently published by our OpenCms. We had supplied a proper sitemap file (auto-generated by OpenCms), but quite prominently, an error concerning the robots.txt was reported by Google’s webmasters tools: “Google couldn’t crawl your site because we were unable to access the robots.txt file.”
“Fetch as Google”, also part of the webmasters tools, would properly display the page, and looking at the httpd access log clearly showed that Googlebot was accessing robots.txt regularily.
But: This is an OpenCms installation. We wouldn’t want to generate static pages upon each access, and robots.txt clearly is a static page. We had activated static export for that page, too, which turned out to be the root cause of all this trouble.
OpenCms will respond to requests for such pages with an HTTP 302 code (“moved temporarily”), pointing the requester to i.e. http://your.site/export/sites/your.site/robots.txt. Google follows that redirection, as the logs and “Fetch as Google” prove.
Unfortunately, Google handles this as a case of a non-accessible page. Which, in case of the “robots.txt” file, will hold off all scans. (Google’s decision itself, not crawling a site if the owner’s intend cannot be clearly determined, is IMO correct. But we’re delivering a syntactically correct file, via an unambiguous redirect – in my opinion, Google should accept that file, too.)
I’ve since changed the “export” property of our robots.txt to “false”, now everything is back in order. At least from Google’s point of view.