Setting up ht://Dig

ht://Dig can be used to make the contents of your website searchable via a form. With it, you can allow visitors to put in search terms, much as they would if using a search engine, such as Google, but have the search restricted to your site. You can do the same with Google by using site: followed by the site name in the search field, e.g. site:mysite.com, but the content returned will only be as current as the last Google update for your site, which could be many weeks old. On the other hand, you can tell htdig to update its search database for your site as often as you wish, e.g. every night, or you could manually run htdig after significant changes to your site to ensure you have a current index for the site.

The instructions below assume you've already downloaded and installed htdig. If you have a Red Hat Linux system, you can check whether the htdig package is installed with the rpm command by using rpm -qi htdig.

The instructions that follow are for setting up the software for use on a server that supports multiple websites and assumes you have root access on that server

  1. Set up a separate directory structure to support multiple websites. Since I have a "support" site, I created a directory called "support" under the main directory where htdig stores its databases.

    mkdir /var/lib/htdig/support

  2. If needed, put the htsearch executable in the directory where you store CGI executable files, e.g. cgi-bin. If you are using the default cgi-bin directory, it may already be there. You can check on a Linux system with the locate command.

    # locate htsearch
    /var/www/cgi-bin/htsearch
    /usr/bin/htsearch

    I can see from the above command that the program is already in the default cgi-bin directory. In my case, I have a separate cgi-bin directory for the website, so I copied the htsearch file to it as well.
  3. I want to have a separate htdig configuration file for my website, rather than use the default configuration file that covers all websites, so I copy the default file to a site specific file and then edit the site specific file.

    cp /etc/htdig.conf /etc/htdig_support.conf

  4. In the htdig_suport.conf file, I need to change the location of the database directory where htdig will store the information about the site's content to the subdirectory I created for the support site.
    
    database_dir:           /var/lib/htdig/support
    
  5. I also need to specify a common directory where htsearch will look for the unique search.html, header.html, footer.html, nomatch.html, syntax.html, and wrapper.html template files, which I can customize to match the layout of this particular site, if I wish.
    common_dir:              /usr/share/htdig/support
  6. I also need to change the start_url value to point to the appropriate website.
    start_url:              http://support.moonpoint.com
  7. I want to add a few other file types from being parsed by htdig, so for the bad_extensions list I added .cab .png, and .rar, so my list now looks like the one below. You can add a "\" at the end of a line to continue the list on another line.
    
    bad_extensions:         .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
            .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css \
            .cab .png .rar
    
  8. The config file can then be saved and I can create the template directory specified in it.

    mkdir /usr/share/htdig/support

  9. I then copy the default template and graphics files to that directory.

    cp /usr/share/htdig/* /usr/share/htdig/support

  10. I can then edit the search, header, footer, and nomatch.html files to match the layout of the site, if I desire.

    vi /usr/share/htdig/support/search.html
    vi /usr/share/htdig/support/header.html
    vi /usr/share/htdig/support/footer.html
    vi /usr/share/htdig/support/nomatch.html

  11. If you need to, you can edit the search form in search.html to point to the htsearch program on your site, e.g. point it to your cgi-bin directory, if it isn't in the default cgi-bin location.

    <form method="post" action="/cgi-bin/htsearch">

  12. You also need to edit the line in search.html that tells htsearch which configuration file to use. By default it will use htdig.conf, but since I created a separate htdig_support.conf file for the website, I need to tell htsearch to use it.

    <input type="hidden" name="config" value="htdig_support">

  13. I can then execute the first run of rundig to generate the database of the content on my site.

    rundig -c /etc/htdig_support.conf

    If you see the following then you may need to edit /usr/bin/rundig, which is a BASH script.

    # rundig -c /etc/htdig_support.conf
    htfuzzy: Unable to open word database /var/lib/htdig/db.words.db

    htfuzzy: Unable to open word database /var/lib/htdig/db.words.db

    My Red Hat Linux 9 system had version 3.2.0 release 16.20021103 of the htdig package. When I got the error messages above, it appeared that the -c /etc/htdig_support.conf option wasn't getting passed to the htfuzzy program. When I looked in the rundig script I saw the following two lines at the end of the file.

    /usr/bin/htfuzzy $verbose metaphone
    /usr/bin/htfuzzy $verbose soundex

    But the $verbose variable wasn't defined anywhere in the script. Instead the $opts variable should have been included on those two lines as is used elsewhere, e.g. in the following two lines.

    $BINDIR/htfuzzy $opts endings
    $BINDIR/htfuzzy $opts synonyms

    The $opts variable refers to options passed to rundig on the command line.

    I've included the original copy of rundig and the corrected copy. When I ran rundig -c /etc/htdig_support.conf after making the corrections, I didn't receive any error messages.

    Another useful option to use when running rundig is the -s option, which will generate statistics about the process and will tell you if you have bad links to files on your site among your webpages as shown in the output below. The "Not found" lines indicate the bad links.

    
    # rundig -s -c /etc/htdig_support.conf
    htdig: Run complete
    htdig: 1 server seen:
    htdig:     support.moonpoint.com:80 496 documents
    
    htdig: Errors to take note of:
    Not found: http://support.moonpoint.com/pc/hardware/dell/4700/4700-drivers.html Ref: http://support.moonpoint.com/blog/blosxom
    Not found: http://support.moonpoint.com/downloads/windows/utilities/Backup/ Ref: http://support.moonpoint.com/downloads/windows/utilities/
    Not found: http://support.moonpoint.com/downloads/linux/redhat/9/i386/RPMS/clamav-0.83-1.0.html Ref: http://support.moonpoint.com/downloads/linux/redhat/9/i386/RPMS/clamd-0.83-1.0.html
    
    HTTP statistics
    ===============
     Persistent connections    : Yes
     HEAD call before GET      : No
     Connections opened        : 533
     Connections closed        : 533
     Changes of server         : 0
     HTTP Requests             : 533
     HTTP KBytes requested     : 573.693
     HTTP Average request time : 0.861163 secs
     HTTP Average speed        : 1.24988 KBytes/secs
    

    From the output above, I can see that I have three bad links that I need to correct on the site.

  14. After the rundig command has completed, I can test my site by placing a link to the search form, search.html on my site. You can also run the htsearch program from the command line, e.g. htsearch -c /etc/htdig_suport.conf. You would then see a prompt for words to search for and the format. If I put "Dimension 4700" for the search words and just hit enter for the format prompt, I will see all the matches htsearch finds

    # htsearch -c /etc/htdig_support.conf
    Enter value for words: Dimension 4700
    Enter value for format:

  15. To have rundig run every night, you can set it up as a cron job. If you use the -a option, it will first make a copy of its prior index of the site and then merge it with the new one when it is finished. That will allow people to use its search functionality even when it is rebuilding its databases. Since I am using /etc/htdig_support for my configuration file, I put the following line in the crontab file (you can edit the crontab file with crontab -e, if you are familiar with how to use the vi editor).

    30 1 * * * /usr/bin/rundig -c /etc/htdig_support.conf -a >>/var/log/htdig 2>>&1

    That sets rundig to run at 1:30 every morning. The output, if any, will be appended to /var/log/htdig and and the 2>>&1 appends any error messages to that file as well.

References:

  1. Xaraya and HTDig install

 

TechRabbit ad 300x250 newegg.com

Justdeals Daily Electronics Deals1x1 px

Valid HTML 4.01 Transitional

Created: July 25, 2005