Renamed Website Files Still Being Crawled

Years ago I started using the server-side scripting language PHP in webpages for this site, so started creating .php files rather than .html files. I later went back to some of the HTML files on the site and inserted PHP code in them and so renamed the files for those webpages to have a .php extension at the end rather than a .html one. In examining the system's Apache error log files, I still see attempts to access the old .html URLs even though they were renamed years ago and references on the site itself to the files updated then to reflect the newer PHP version of the page. I assume someone somewhere on the web may have placed a link to the original version in a web page and that web crawlers, aka spiders or bots, that index content on the web are still following a link from another site in attempting to access the html version that hasn't existed for years. E.g., when examining the error log for today, I found an attempt to access /network/web/tools/search/htdig-setup.html even though years ago I replaced th .html file with /network/web/tools/search/htdig-setup.php. I created the HTML version on July 25, 2005 and I believe I replaced it with the PHP version on October 22, 2006.

Yet today, I saw a "File does not exist" entry in the error log pointing to the htdig-setup.html file from an IP address of 183.60.213.31 and an attempt to access it two days ago from 183.60.215.32. Looking in the Apache CustomLog file, which contains the user agent string, which can be used to identify the browser or web crawler that requested a webpage or other file from the web server when someone accesses a page from a browser or it is searched by a web crawler, I could see that both IP addresses had used the following user agent string when attempting to access the file:

"Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)

Seeing "spider" or "bot" in the user agent string usually means that the entity requesting a page on the website is a web crawler like Google's Googlebot or Microsoft's bingbot. In this case the EasouSpider appears to be a Chinese search engine indexing the site as the Asia Pacific Network Information Centre (APNIC), which is the regional Internet registry for the Asia Pacific region, reports that the IP address block 183.0.0.0 - 183.63.255.255 is allocated to the following entity:

netname:	CHINANET-GD
descr:	CHINANET Guangdong province network
descr:	Data Communication Division
descr:	China Telecom
country:	CN

According to a June 12, 2010 article, Easou search skills trump Baidu at the Asia Times website by Sherman So, Easou helps mobile-phone users in China search the Internet and even Eclipses China's largest Internet search company, Baidu, as the primary search service for the country's mobile users with double Baidu's volume for mobile traffic, according to an internal report issued by China's leading mobile-phone operator, China Mobile.

The two attempts at accessing the file this month by Easou were the only attempts to access the old, no longer existing .html file this year, but in checking last year's logs, I found a number of attempts to access it by other web crawlers from the following 43 unique IP addresses in 2013:

IP Address	Bot
123.151.139.211	Sosopider
157.55.32.114	bingbot
157.55.32.28	bingbot
157.55.32.58	bingbot
157.55.35.86	bingbot
157.56.229.184	bingbot
157.56.92.144	bingbot
157.56.93.186	bingbot
173.199.114.147	AhrefsBot
173.199.114.211	AhrefsBot
173.199.115.59	AhrefsBot
173.199.116.179	AhrefsBot
173.199.117.251	AhrefsBot
173.199.119.139	AhrefsBot
180.76.5.162	Baiduspider
180.76.5.166	Baiduspider
180.76.5.178	Baiduspider
180.76.5.23	Baiduspider
180.76.5.27	Baiduspider
180.76.5.8	Baiduspider
180.76.6.225	Baiduspider
180.76.6.36	Baiduspider
199.127.227.203	MJ12bot
199.21.99.88	YandexBot
202.46.59.196	zh-CN
202.46.63.202	zh-CN
204.124.181.85	MJ12bot
208.167.230.59	AhrefsBot
23.20.57.138
46.105.99.120	MJ12bot
5.10.83.28	AhrefsBot
5.10.83.90	AhrefsBot
5.9.7.208	MJ12bot
65.55.24.217	bingbot
65.55.55.230	bingbot
66.249.72.234
66.249.75.234	Googlebot
66.249.76.174	Googlebot
66.249.76.234	Googlebot
85.17.29.107	MJ12bot
85.178.84.125	MJ12bot
88.190.44.26	MJ12bot
91.232.96.23

The IP addresses 202.46.59.196 and 202.46.63.202 have the following user agent string:

Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.2) Firefox/3.5.2

An nslookup on the IP addresses yields a name of ptr.cnsat.com.cn for both. There is no website accessible from that name nor the IP addresses, so I can't be certain that they correspond to a web crawler.

Since I saw so many attempts by web crawlers to access the older version of the page that no longer exists, I created a redirect for the old page to point to the new page, i.e., an .htaccess file in the directory where the old file was located containing the following line:

Redirect 301 /network/web/tools/search/htdig-setup.html /network/web/tools/search/htdig-setup.php

I then added a <Directory> section to the section in the Apache httpd.conf file pertaining to the website to allow redirects to occur in that directory and then restarted the web server with apachectl restart.

<Directory /home/jdoe/public_html/network/web/tools/search> AllowOverride FileInfo </Directory>

So I'm going to have to remember to create redirects for any files that I move or rename on the site, if they've been on the site for more than a short period of time.

References:

Easou search skills trump Baidu
Asia Times
By: Sherman So
Date: June 12, 2010
Redirecting a URL on an Apache Web Server
MoonPoint Support
Date: March 9, 2014

Justdeals Daily Electronics Deals

Created: Saturday March 22, 2014