I ran ht://Dig to index the site today using the command
/usr/bin/rundig -c /etc/htdig_support.conf
>>/var/log/htdig 2>&1
, but when I performed htdig searches
of the site after the indexing process completed, which took a considerable
amount of time, none of the searches returned any results. When I checked
the output file for the rundig command, /var/log/htdig, I saw
the errors below:
# cat /var/log/htdig FATAL ERROR:Compressor::get_vals invalid comptype FATAL ERROR at file:WordBitCompress.cc line:827 !!! /usr/bin/rundig: line 36: 23767 Segmentation fault $BINDIR/htdig -i $opts $ stats $alt /usr/bin/rundig: line 81: 24766 Segmentation fault /usr/bin/htfuzzy $opts m etaphone /usr/bin/rundig: line 82: 24767 Segmentation fault /usr/bin/htfuzzy $opts s oundex
I included a search feature on each page of the blog that uses Fletcher Penney's find plugin to allow a search of the blog for information. Underneath the search box there is an "Advanced Search" link that provides more advanced search capabilities. Clicking on it will display the same blog page as was visible before, but with advanced search options visible. This was resulting in ht://Dig returning the same page multiple times whenever I used it to search the entire site (the Find plugin only searches the blog while I have htdig search the entire site).
I thought I might reduce the extraneous results for htdig queries, reduce the time to index the site when running rundig, and possibly elimininate the "FATAL ERROR:Compressor::get_vals invalid comptype" error message by having htdig exclude the "Advanced Search" links when indexing the site. Since that link on pages always includes "advanced_search=1" in the link URL, I edited the htdig configuration file for the website, which is /etc/htdig_support.conf in this case, and added "advanced_search=1" to the exclude_urls list. So I now have the following line in that conf file (the "/cgi-bin/ .cgi" was there by default):
exclude_urls: /cgi-bin/ .cgi advanced_search=1
I also added some file extensions to the list of filetypes htdig should exclude from its indexing process. I added ".mp3 .img .iso .dat .dll .scr" to the bad_extensions section, so I now have the following in that list:
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css \
.cab .png .rar .mp3 .img .iso .dat .dll .scr
There is no need for htdig to index binary files. It will only take more time for htdig to index the site if they aren't excluded and greatly increase the changes htdig will fail while indexing the site. If you store other types of music or movie files on a site, you should add them to the bad_extensions list, if you use htdig.
When I reran rundig
with the command
/usr/bin/rundig -c /etc/htdig_support.conf >/var/log/htdig 2>&1
,
it did not fail this time and when I performed htdig searches of the site, I
didn't get results returned that were duplicates due to the Blosxom Find
plugin's "Advanced Search" links.
References:
-
RE: [htdig] Segfault indexing a site with 3.2.0b2
May 23 2000
ht://Dig 3.x list archive -
Error in zlib Compressor for WordDB
July 30, 2002
web.htdig.devel -
FindPlugin
Author: Fletcher T. Penney