I installed ht://Dig 3.2.0b5 on one of my Solaris 10 servers.
When I ran htdig on the server, it did not appear to be indexing
my website. I used /usr/localbin/rundig -s -c
/usr/local/conf/htdig_support.conf
to see statistics on what
it was doing. It was only opening one connect and making just two HTTP
requests rather than indexing the whole site.
# /usr/local/bin/rundig -s -c /usr/local/conf/htdig_support.conf
htdig: Run complete
htdig: 1 server seen:
htdig: www.example.com:80 0 documents
HTTP statistics
===============
Persistent connections : Yes
HEAD call before GET : Yes
Connections opened : 1
Connections closed : 1
Changes of server : 0
HTTP Requests : 2
HTTP KBytes requested : 0.460938
HTTP Average request time : 0 secs
HTTP Average speed : Inf KBytes/secs
htpurge: Database is empty!
So I tried running htdig instead of rundig and used the -vvv
option
to see more details.
# /usr/local/bin/htdig -vvv -c /usr/local/conf/htdig_support.conf
ht://dig Start Time: Sun Oct 22 19:56:19 2006
1:1:http://www.example.com/
New server: www.example.com, 80
- Persistent connections: enabled
- HEAD before GET: enabled
- Timeout: 30
- Connection space: 0
- Max Documents: -1
- TCP retries: 1
- TCP wait time: 5
- Accept-Language:
Trying to retrieve robots.txt file
Making HTTP request on http://www.example.com/robots.txt
Header line: HTTP/1.1 200 OK
Header line: Date: Sun, 22 Oct 2006 23:56:19 GMT
Header line: Server: Apache/2.0.55 (Unix) DAV/2
Header line: Last-Modified: Fri, 31 Mar 2006 17:59:25 GMT
Header line: ETag: "1e9d-1a-3389b940"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 26
Header line: Content-Type: text/plain
Request time: 0 secs
Header line: HTTP/1.1 200 OK
Header line: Date: Sun, 22 Oct 2006 23:56:19 GMT
Header line: Server: Apache/2.0.55 (Unix) DAV/2
Header line: Last-Modified: Fri, 31 Mar 2006 17:59:25 GMT
Header line: ETag: "1e9d-1a-3389b940"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 26
Header line: Content-Type: text/plain
Request time: 0 secs
Parsing robots.txt file using myname = htdig
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow:
Found 'disallow' line:
Pattern:
pushed
Rejected: forbidden by server robots.txt!
pick: www.example.com, # servers = 1
> www.example.com supports HTTP persistent connections (infinite)
ht://dig End Time: Sun Oct 22 19:56:19 2006
It was apparently finding the word "disallow" in the robots.txt file and then stopping. The contents of robots.txt is shown below:
User-agent: *
Disallow:
Though the word "disallow" was there, there was nothing specified after it.
I removed that line and tried htdig -vvv
again, but again
I saw a message indicating that htdig was stopping because robots.txt was
disallowing it.
Parsing robots.txt file using myname = htdig
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Pattern:
pushed
Rejected: forbidden by server robots.txt!
I then changed the User-agent: *
line in robots.txt to
User-agent: htdig
to see if that would help. It didn't.
However, when I changed the line back to User-agent: *
and
put the Disallow line back in, but specified a disallow directory, then
htdig indexed my site when run. E.g., when I used Disallow: /main
,
it worked. I next tried reverting to the original robots.txt with the addition
of lines just for htdig
User-agent: *
Disallow:
# ht://Dig 3.2.0b5 is failing when there is nothing specified after "Disallow"
User-agent: htdig
Disallow: /abcde12345
That too allowed htdig to work. Since it is unlikely I will ever use abcde12345 in a file or directory name, I left robots.txt with those lines. I could have just put the "/abcde12345" after the first "Disallow", though, since I'm not really worried that other robots might not index a directory in which I used that name.
When I searched the web with Google, I found others who had experienced similar problems with htdig indexes of their websites. I have ht://Dig installed on another server with the same robots.txt file - in fact I copied the robots.txt file from that other server - without any problem. The other system is a Red Hat Linux 9 system with the following htdig RPM installed.
Version : 3.2.0 Vendor: Red Hat, Inc.
Release : 16.20021103
I don't know why the problem occurred on the Solaris server with ht://Dig 3.2.0b5, since I should be able to have just the "Disallow:" on the line in robots.txt, but since I have a workaround and need to move onto other problems, I will just leave the workaround in place.
References:
Created: Sunday October 22, 2006 10:04 PM