Renamed Website Files Still Being Crawled

Years ago I started using the server-side scripting language PHP in webpages for this site, so started creating .php files rather than .html files. I later went back to some of the HTML files on the site and inserted PHP code in them and so renamed the files for those webpages to have a .php extension at the end rather than a .html one. In examining the system's Apache error log files, I still see attempts to access the old .html URLs even though they were renamed years ago and references on the site itself to the files updated then to reflect the newer PHP version of the page. I assume someone somewhere on the web may have placed a link to the original version in a web page and that web crawlers, aka spiders or bots, that index content on the web are still following a link from another site in attempting to access the html version that hasn't existed for years. E.g., when examining the error log for today, I found an attempt to access /network/web/tools/search/htdig-setup.html even though years ago I replaced th .html file with /network/web/tools/search/htdig-setup.php. I created the HTML version on July 25, 2005 and I believe I replaced it with the PHP version on October 22, 2006.

Yet today, I saw a "File does not exist" entry in the error log pointing to the htdig-setup.html file from an IP address of 183.60.213.31 and an attempt to access it two days ago from 183.60.215.32. Looking in the Apache CustomLog file, which contains the user agent string, which can be used to identify the browser or web crawler that requested a webpage or other file from the web server when someone accesses a page from a browser or it is searched by a web crawler, I could see that both IP addresses had used the following user agent string when attempting to access the file:

"Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)

Seeing "spider" or "bot" in the user agent string usually means that the entity requesting a page on the website is a web crawler like Google's Googlebot or Microsoft's bingbot. In this case the EasouSpider appears to be a Chinese search engine indexing the site as the Asia Pacific Network Information Centre (APNIC), which is the regional Internet registry for the Asia Pacific region, reports that the IP address block 183.0.0.0 - 183.63.255.255 is allocated to the following entity:

netname:CHINANET-GD
descr:CHINANET Guangdong province network
descr:Data Communication Division
descr:China Telecom
country:CN

According to a June 12, 2010 article, Easou search skills trump Baidu at the Asia Times website by Sherman So, Easou helps mobile-phone users in China search the Internet and even Eclipses China's largest Internet search company, Baidu, as the primary search service for the country's mobile users with double Baidu's volume for mobile traffic, according to an internal report issued by China's leading mobile-phone operator, China Mobile.

The two attempts at accessing the file this month by Easou were the only attempts to access the old, no longer existing .html file this year, but in checking last year's logs, I found a number of attempts to access it by other web crawlers from the following 43 unique IP addresses in 2013:


Malwarebytes Anti-Malware Premium
Generic Category (English)120x600
IP AddressBot
123.151.139.211 Sosopider
157.55.32.114 bingbot
157.55.32.28 bingbot
157.55.32.58 bingbot
157.55.35.86 bingbot
157.56.229.184 bingbot
157.56.92.144 bingbot
157.56.93.186 bingbot
173.199.114.147 AhrefsBot
173.199.114.211 AhrefsBot
173.199.115.59 AhrefsBot
173.199.116.179 AhrefsBot
173.199.117.251 AhrefsBot
173.199.119.139 AhrefsBot
180.76.5.162 Baiduspider
180.76.5.166 Baiduspider
180.76.5.178 Baiduspider
180.76.5.23 Baiduspider
180.76.5.27 Baiduspider
180.76.5.8 Baiduspider
180.76.6.225 Baiduspider
180.76.6.36 Baiduspider
199.127.227.203 MJ12bot
199.21.99.88 YandexBot
202.46.59.196 zh-CN
202.46.63.202 zh-CN
204.124.181.85 MJ12bot
208.167.230.59 AhrefsBot
23.20.57.138  
46.105.99.120 MJ12bot
5.10.83.28 AhrefsBot
5.10.83.90 AhrefsBot
5.9.7.208 MJ12bot
65.55.24.217 bingbot
65.55.55.230 bingbot
66.249.72.234  
66.249.75.234 Googlebot
66.249.76.174 Googlebot
66.249.76.234 Googlebot
85.17.29.107 MJ12bot
85.178.84.125 MJ12bot
88.190.44.26 MJ12bot
91.232.96.23  

The IP addresses 202.46.59.196 and 202.46.63.202 have the following user agent string:

Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.2) Firefox/3.5.2

An nslookup on the IP addresses yields a name of ptr.cnsat.com.cn for both. There is no website accessible from that name nor the IP addresses, so I can't be certain that they correspond to a web crawler.

Since I saw so many attempts by web crawlers to access the older version of the page that no longer exists, I created a redirect for the old page to point to the new page, i.e., an .htaccess file in the directory where the old file was located containing the following line:

Redirect 301 /network/web/tools/search/htdig-setup.html /network/web/tools/search/htdig-setup.php

I then added a <Directory> section to the section in the Apache httpd.conf file pertaining to the website to allow redirects to occur in that directory and then restarted the web server with apachectl restart.

<Directory /home/jdoe/public_html/network/web/tools/search>
    AllowOverride FileInfo
</Directory>

So I'm going to have to remember to create redirects for any files that I move or rename on the site, if they've been on the site for more than a short period of time.

References:

  1. Easou search skills trump Baidu
    Asia Times
    By: Sherman So
    Date: June 12, 2010
  2. Redirecting a URL on an Apache Web Server
    MoonPoint Support
    Date: March 9, 2014

 

TechRabbit ad 300x250 newegg.com

Justdeals Daily Electronics Deals1x1 px

Valid HTML 4.01 Transitional

Created: Saturday March 22, 2014