I installed ht://Dig 3.2.0b5 on one of my Solaris 10 servers.
When I ran htdig on the server, it did not appear to be indexing
my website. I used
/usr/localbin/rundig -s -c
/usr/local/conf/htdig_support.conf to see statistics on what
it was doing. It was only opening one connect and making just two HTTP
requests rather than indexing the whole site.
# /usr/local/bin/rundig -s -c /usr/local/conf/htdig_support.conf htdig: Run complete htdig: 1 server seen: htdig: www.example.com:80 0 documents HTTP statistics =============== Persistent connections : Yes HEAD call before GET : Yes Connections opened : 1 Connections closed : 1 Changes of server : 0 HTTP Requests : 2 HTTP KBytes requested : 0.460938 HTTP Average request time : 0 secs HTTP Average speed : Inf KBytes/secs htpurge: Database is empty!
So I tried running htdig instead of rundig and used the
to see more details.
# /usr/local/bin/htdig -vvv -c /usr/local/conf/htdig_support.conf ht://dig Start Time: Sun Oct 22 19:56:19 2006 1:1:http://www.example.com/ New server: www.example.com, 80 - Persistent connections: enabled - HEAD before GET: enabled - Timeout: 30 - Connection space: 0 - Max Documents: -1 - TCP retries: 1 - TCP wait time: 5 - Accept-Language: Trying to retrieve robots.txt file Making HTTP request on http://www.example.com/robots.txt Header line: HTTP/1.1 200 OK Header line: Date: Sun, 22 Oct 2006 23:56:19 GMT Header line: Server: Apache/2.0.55 (Unix) DAV/2 Header line: Last-Modified: Fri, 31 Mar 2006 17:59:25 GMT Header line: ETag: "1e9d-1a-3389b940" Header line: Accept-Ranges: bytes Header line: Content-Length: 26 Header line: Content-Type: text/plain Request time: 0 secs Header line: HTTP/1.1 200 OK Header line: Date: Sun, 22 Oct 2006 23:56:19 GMT Header line: Server: Apache/2.0.55 (Unix) DAV/2 Header line: Last-Modified: Fri, 31 Mar 2006 17:59:25 GMT Header line: ETag: "1e9d-1a-3389b940" Header line: Accept-Ranges: bytes Header line: Content-Length: 26 Header line: Content-Type: text/plain Request time: 0 secs Parsing robots.txt file using myname = htdig Robots.txt line: User-agent: * Found 'user-agent' line: * Robots.txt line: Disallow: Found 'disallow' line: Pattern: pushed Rejected: forbidden by server robots.txt! pick: www.example.com, # servers = 1 > www.example.com supports HTTP persistent connections (infinite) ht://dig End Time: Sun Oct 22 19:56:19 2006
It was apparently finding the word "disallow" in the robots.txt file and then stopping. The contents of robots.txt is shown below:
User-agent: * Disallow:
Though the word "disallow" was there, there was nothing specified after it.
I removed that line and tried
htdig -vvv again, but again
I saw a message indicating that htdig was stopping because robots.txt was
Parsing robots.txt file using myname = htdig Robots.txt line: User-agent: * Found 'user-agent' line: * Pattern: pushed Rejected: forbidden by server robots.txt!
I then changed the
User-agent: * line in robots.txt to
User-agent: htdig to see if that would help. It didn't.
However, when I changed the line back to
User-agent: * and
put the Disallow line back in, but specified a disallow directory, then
htdig indexed my site when run. E.g., when I used
it worked. I next tried reverting to the original robots.txt with the addition
of lines just for htdig
User-agent: * Disallow: # ht://Dig 3.2.0b5 is failing when there is nothing specified after "Disallow" User-agent: htdig Disallow: /abcde12345
That too allowed htdig to work. Since it is unlikely I will ever use abcde12345 in a file or directory name, I left robots.txt with those lines. I could have just put the "/abcde12345" after the first "Disallow", though, since I'm not really worried that other robots might not index a directory in which I used that name.
When I searched the web with Google, I found others who had experienced similar problems with htdig indexes of their websites. I have ht://Dig installed on another server with the same robots.txt file - in fact I copied the robots.txt file from that other server - without any problem. The other system is a Red Hat Linux 9 system with the following htdig RPM installed.
Version : 3.2.0 Vendor: Red Hat, Inc. Release : 16.20021103
I don't know why the problem occurred on the Solaris server with ht://Dig 3.2.0b5, since I should be able to have just the "Disallow:" on the line in robots.txt, but since I have a workaround and need to move onto other problems, I will just leave the workaround in place.
Created: Sunday October 22, 2006 10:04 PM