htdig Not Indexing Site

I installed ht://Dig 3.2.0b5 on one of my Solaris 10 servers. When I ran htdig on the server, it did not appear to be indexing my website. I used /usr/localbin/rundig -s -c /usr/local/conf/htdig_support.conf to see statistics on what it was doing. It was only opening one connect and making just two HTTP requests rather than indexing the whole site.


# /usr/local/bin/rundig -s -c /usr/local/conf/htdig_support.conf
htdig: Run complete
htdig: 1 server seen:
htdig:     www.example.com:80 0 documents

HTTP statistics
===============
 Persistent connections    : Yes
 HEAD call before GET      : Yes
 Connections opened        : 1
 Connections closed        : 1
 Changes of server         : 0
 HTTP Requests             : 2
 HTTP KBytes requested     : 0.460938
 HTTP Average request time : 0 secs
 HTTP Average speed        : Inf KBytes/secs

htpurge: Database is empty!

So I tried running htdig instead of rundig and used the -vvv option to see more details.


# /usr/local/bin/htdig -vvv -c /usr/local/conf/htdig_support.conf
ht://dig Start Time: Sun Oct 22 19:56:19 2006
        1:1:http://www.example.com/
New server: www.example.com, 80
 - Persistent connections: enabled
 - HEAD before GET: enabled
 - Timeout: 30
 - Connection space: 0
 - Max Documents: -1
 - TCP retries: 1
 - TCP wait time: 5
 - Accept-Language:
Trying to retrieve robots.txt file
Making HTTP request on http://www.example.com/robots.txt
Header line: HTTP/1.1 200 OK
Header line: Date: Sun, 22 Oct 2006 23:56:19 GMT
Header line: Server: Apache/2.0.55 (Unix) DAV/2
Header line: Last-Modified: Fri, 31 Mar 2006 17:59:25 GMT
Header line: ETag: "1e9d-1a-3389b940"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 26
Header line: Content-Type: text/plain
Request time: 0 secs
Header line: HTTP/1.1 200 OK
Header line: Date: Sun, 22 Oct 2006 23:56:19 GMT
Header line: Server: Apache/2.0.55 (Unix) DAV/2
Header line: Last-Modified: Fri, 31 Mar 2006 17:59:25 GMT
Header line: ETag: "1e9d-1a-3389b940"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 26
Header line: Content-Type: text/plain
Request time: 0 secs
Parsing robots.txt file using myname = htdig
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow:
Found 'disallow' line:
Pattern:
 pushed
   Rejected: forbidden by server robots.txt!
pick: www.example.com, # servers = 1
> www.example.com supports HTTP persistent connections (infinite)
ht://dig End Time: Sun Oct 22 19:56:19 2006

It was apparently finding the word "disallow" in the robots.txt file and then stopping. The contents of robots.txt is shown below:


User-agent: *
Disallow:

Though the word "disallow" was there, there was nothing specified after it. I removed that line and tried htdig -vvv again, but again I saw a message indicating that htdig was stopping because robots.txt was disallowing it.


Parsing robots.txt file using myname = htdig
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Pattern:
 pushed
   Rejected: forbidden by server robots.txt!

I then changed the User-agent: * line in robots.txt to User-agent: htdig to see if that would help. It didn't. However, when I changed the line back to User-agent: * and put the Disallow line back in, but specified a disallow directory, then htdig indexed my site when run. E.g., when I used Disallow: /main, it worked. I next tried reverting to the original robots.txt with the addition of lines just for htdig


User-agent: *
Disallow:

#  ht://Dig 3.2.0b5 is failing when there is nothing specified after "Disallow"
User-agent: htdig
Disallow: /abcde12345

That too allowed htdig to work. Since it is unlikely I will ever use abcde12345 in a file or directory name, I left robots.txt with those lines. I could have just put the "/abcde12345" after the first "Disallow", though, since I'm not really worried that other robots might not index a directory in which I used that name.

When I searched the web with Google, I found others who had experienced similar problems with htdig indexes of their websites. I have ht://Dig installed on another server with the same robots.txt file - in fact I copied the robots.txt file from that other server - without any problem. The other system is a Red Hat Linux 9 system with the following htdig RPM installed.


Version     : 3.2.0                             Vendor: Red Hat, Inc.
Release     : 16.20021103

I don't know why the problem occurred on the Solaris server with ht://Dig 3.2.0b5, since I should be able to have just the "Disallow:" on the line in robots.txt, but since I have a workaround and need to move onto other problems, I will just leave the workaround in place.

References:

  1. [htdig] Problem with finding server
    By Wolfgang Winkler
    Wed, 25 Jan 2006 01:58:06
  2. Site Administration: Understanding Robots and the Robot-Exclusion Standard
    by Shelley Powers
  3. A Standard for Robot Exclusion
    Martijn Koster
  4. Setting up ht://Dig on a Solaris System
    By Jim Cameron
    October 22, 2006

 

TechRabbit ad 300x250 newegg.com

Justdeals Daily Electronics Deals1x1 px

Valid HTML 4.01 Transitional

Created: Sunday October 22, 2006 10:04 PM