site:
followed by the site name in the
search field, e.g. site:mysite.com
, but the content returned will
only be as current as the last Google update for your site, which could be
many weeks old. On the other hand, you can tell htdig to update its search
database for your site as often as you wish, e.g. every night, or you could
manually run htdig after significant changes to your site to ensure you have
a current index for the site.
The instructions below assume you've already downloaded and installed htdig.
If you have a Red Hat Linux system, you can check whether the htdig package
is installed with the rpm
command by using rpm -qi htdig
.
The instructions that follow are for setting up the software for use on a server that supports multiple websites and assumes you have root access on that server
mkdir /var/lib/htdig/support
# locate htsearch
/var/www/cgi-bin/htsearch
/usr/bin/htsearch
cp /etc/htdig.conf /etc/htdig_support.conf
database_dir: /var/lib/htdig/support
common_dir: /usr/share/htdig/support
start_url
value to point to the
appropriate website.
start_url: http://support.moonpoint.com
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css \
.cab .png .rar
mkdir /usr/share/htdig/support
cp /usr/share/htdig/* /usr/share/htdig/support
vi /usr/share/htdig/support/search.html
vi /usr/share/htdig/support/header.html
vi /usr/share/htdig/support/footer.html
vi /usr/share/htdig/support/nomatch.html
<form method="post" action="/cgi-bin/htsearch">
<input type="hidden" name="config" value="htdig_support">
rundig -c /etc/htdig_support.conf
If you see the following then you may need to edit /usr/bin/rundig, which is a BASH script.
# rundig -c /etc/htdig_support.conf
htfuzzy: Unable to open word database /var/lib/htdig/db.words.db
htfuzzy: Unable to open word database /var/lib/htdig/db.words.db
My Red Hat Linux 9 system had version 3.2.0 release 16.20021103 of the
htdig package. When I got the error messages above, it appeared that
the -c /etc/htdig_support.conf
option wasn't getting passed
to the htfuzzy program. When I looked in the rundig script
I saw the following two lines at the end of the file.
/usr/bin/htfuzzy $verbose metaphone
/usr/bin/htfuzzy $verbose soundex
But the $verbose
variable wasn't defined anywhere in the script.
Instead the $opts
variable should have been included on those
two lines as is used elsewhere, e.g. in the following two lines.
$BINDIR/htfuzzy $opts endings
$BINDIR/htfuzzy $opts synonyms
The $opts variable refers to options passed to rundig on the command line.
I've included the original copy of rundig and
the corrected copy. When I ran
rundig -c /etc/htdig_support.conf
after making the corrections,
I didn't receive any error messages.
Another useful option to use when running rundig is the -s
option, which will generate statistics about the process and will tell you
if you have bad links to files on your site among your webpages as shown
in the output below. The "Not found" lines indicate the bad links.
# rundig -s -c /etc/htdig_support.conf
htdig: Run complete
htdig: 1 server seen:
htdig: support.moonpoint.com:80 496 documents
htdig: Errors to take note of:
Not found: http://support.moonpoint.com/pc/hardware/dell/4700/4700-drivers.html Ref: http://support.moonpoint.com/blog/blosxom
Not found: http://support.moonpoint.com/downloads/windows/utilities/Backup/ Ref: http://support.moonpoint.com/downloads/windows/utilities/
Not found: http://support.moonpoint.com/downloads/linux/redhat/9/i386/RPMS/clamav-0.83-1.0.html Ref: http://support.moonpoint.com/downloads/linux/redhat/9/i386/RPMS/clamd-0.83-1.0.html
HTTP statistics
===============
Persistent connections : Yes
HEAD call before GET : No
Connections opened : 533
Connections closed : 533
Changes of server : 0
HTTP Requests : 533
HTTP KBytes requested : 573.693
HTTP Average request time : 0.861163 secs
HTTP Average speed : 1.24988 KBytes/secs
From the output above, I can see that I have three bad links that I need to correct on the site.
htsearch -c
/etc/htdig_suport.conf
. You would then see a prompt for words
to search for and the format. If I put "Dimension 4700" for the search
words and just hit enter for the format prompt, I will see all the
matches htsearch finds
# htsearch -c /etc/htdig_support.conf
Enter value for words: Dimension 4700
Enter value for format:
-a
option, it will first make a copy
of its prior index of the site and then merge it with the new
one when it is finished. That will allow people to use its search
functionality even when it is rebuilding its databases. Since I am
using /etc/htdig_support for my configuration file, I put the
following line in the crontab file (you can edit the crontab file
with crontab -e
, if you are familiar with how to
use the vi editor).
30 1 * * * /usr/bin/rundig -c /etc/htdig_support.conf -a >>/var/log/htdig 2>>&1
That sets rundig to run at 1:30 every morning. The output, if any, will be
appended to /var/log/htdig and and the 2>>&1
appends any
error messages to that file as well.
References:
Created: July 25, 2005