Web Spiders Crawling Site on 2016-04-23
When I viewed a page on the site this morning, the page took a long time to
load in my browser. I checked the
Apache access
log to see what it was showing in regards to site activity, since I had noticed
several
web
crawlers, aka web spiders, accessing the site yesterday. Checking the
site's Apache access log for today, I saw that it was being crawled
by four web spiders simultaneously: baiduspider, bingbot, MegaIndex.ru,
and the linkdexbot. The first, baiduspider, is associated with the Chinese
search engine company, Baidu, and the second, bingbot, is Microsoft's web
crawler, which is used by its Bing search engine. The latter two, MegaIndex.ru
and linkdexbot are associated with search engine optimization (SEO) companies.
[ More Info ]
[/network/web/crawlers]
permanent link
Renamed Website Files Still Being Crawled
I've noticed in the site's error logs that files that haven't existed on
the site for years are producing error entries when
web crawers still
attempt to access them. Apparently, elsewhere on the web that are still
links pointing to the nonexistent files, which has led me to conclude that
I need to create redirects for those files on the site that I move or rename,
if the files have been on the site for any significant lengthh of time.
[ More Info ]
[/network/web/crawlers]
permanent link
Turnitin Crawler
While troubleshooting a problem with a website using
wireshark, I was capturing
HTTP traffic.
I noticed a connection from 65.98.224.2 with the contents
of the first packet received from that address showing
the software accessing my support website identifying itself
as shown below:
User-Agent: TurnitinBot/2.1 (http://www.turnitin.com/robot/crawlerinfo.html)
Checking the URL listed, I found the following:
Chances are that you are reading this because you found a reference
to this web page from your web server logs. This reference was left
by Turnitin.com's web crawling robot, also known as TurnitinBot. This
robot collects content from the Internet for the sole purpose of helping
educational institutions prevent plagiarism. In particular, we compare
student papers against the content we find on the Internet to see if
we can find similarities. For more information on this service, please
visit www.turnitin.com
The Wikipedia
article on Turnitin states that it is as "an Internet-based
plagiarism-detection service created by iParadigms, LLC. Institutions
(typically universities and high schools) buy licenses to submit essays
to the Turnitin website, which checks the document for plagiarism."
I had read that many schools now use such services to deter students
from submitting plagiarized papers. I've seen services offerring pre-written
papers for students to submit for classes, so I can see the need for teachers
to use such detection services. I didn't realize this service crawled websites
to index materials on the web as part of its detection efforts, but it makes
sense to me that the service would do so. This is the first time I've
noticed this particular web
crawler
[/network/web/crawlers]
permanent link