Renamed Website Files Still Being Crawled

Years ago I started using the server-side scripting language PHP in webpages for this site, so started creating .php files rather than .html files. I later went back to some of the HTML files on the site and inserted PHP code in them and so renamed the files for those webpages to have a .php extension at the end rather than a .html one. In examining the system's Apache error log files, I still see attempts to access the old .html URLs even though they were renamed years ago and references on the site itself to the files updated then to reflect the newer PHP version of the page. I assume someone somewhere on the web may have placed a link to the original version in a web page and that web crawlers, aka spiders or bots, that index content on the web are still following a link from another site in attempting to access the html version that hasn't existed for years. E.g., when examining the error log for today, I found an attempt to access /network/web/tools/search/htdig-setup.html even though years ago I replaced th .html file with /network/web/tools/search/htdig-setup.php. I created the HTML version on July 25, 2005 and I believe I replaced it with the PHP version on October 22, 2006.

Yet today, I saw a "File does not exist" entry in the error log pointing to the htdig-setup.html file from an IP address of and an attempt to access it two days ago from Looking in the Apache CustomLog file, which contains the user agent string, which can be used to identify the browser or web crawler that requested a webpage or other file from the web server when someone accesses a page from a browser or it is searched by a web crawler, I could see that both IP addresses had used the following user agent string when attempting to access the file:

"Mozilla/5.0 (compatible; EasouSpider; +

Seeing "spider" or "bot" in the user agent string usually means that the entity requesting a page on the website is a web crawler like Google's Googlebot or Microsoft's bingbot. In this case the EasouSpider appears to be a Chinese search engine indexing the site as the Asia Pacific Network Information Centre (APNIC), which is the regional Internet registry for the Asia Pacific region, reports that the IP address block - is allocated to the following entity:

descr:CHINANET Guangdong province network
descr:Data Communication Division
descr:China Telecom

According to a June 12, 2010 article, Easou search skills trump Baidu at the Asia Times website by Sherman So, Easou helps mobile-phone users in China search the Internet and even Eclipses China's largest Internet search company, Baidu, as the primary search service for the country's mobile users with double Baidu's volume for mobile traffic, according to an internal report issued by China's leading mobile-phone operator, China Mobile.

The two attempts at accessing the file this month by Easou were the only attempts to access the old, no longer existing .html file this year, but in checking last year's logs, I found a number of attempts to access it by other web crawlers from the following 43 unique IP addresses in 2013:

Malwarebytes Anti-Malware Premium
Generic Category (English)120x600
IP AddressBot Sosopider bingbot bingbot bingbot bingbot bingbot bingbot bingbot AhrefsBot AhrefsBot AhrefsBot AhrefsBot AhrefsBot AhrefsBot Baiduspider Baiduspider Baiduspider Baiduspider Baiduspider Baiduspider Baiduspider Baiduspider MJ12bot YandexBot zh-CN zh-CN MJ12bot AhrefsBot MJ12bot AhrefsBot AhrefsBot MJ12bot bingbot bingbot Googlebot Googlebot Googlebot MJ12bot MJ12bot MJ12bot  

The IP addresses and have the following user agent string:

Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv: Firefox/3.5.2

An nslookup on the IP addresses yields a name of for both. There is no website accessible from that name nor the IP addresses, so I can't be certain that they correspond to a web crawler.

Since I saw so many attempts by web crawlers to access the older version of the page that no longer exists, I created a redirect for the old page to point to the new page, i.e., an .htaccess file in the directory where the old file was located containing the following line:

Redirect 301 /network/web/tools/search/htdig-setup.html /network/web/tools/search/htdig-setup.php

I then added a <Directory> section to the section in the Apache httpd.conf file pertaining to the website to allow redirects to occur in that directory and then restarted the web server with apachectl restart.

<Directory /home/jdoe/public_html/network/web/tools/search>
    AllowOverride FileInfo

So I'm going to have to remember to create redirects for any files that I move or rename on the site, if they've been on the site for more than a short period of time.


  1. Easou search skills trump Baidu
    Asia Times
    By: Sherman So
    Date: June 12, 2010
  2. Redirecting a URL on an Apache Web Server
    MoonPoint Support
    Date: March 9, 2014


TechRabbit ad 300x250

Justdeals Daily Electronics Deals1x1 px

Valid HTML 4.01 Transitional

Created: Saturday March 22, 2014