MoonPoint Support Weblog

Web Spiders Crawling Site on 2016-04-23

When I viewed a page on the site this morning, the page took a long time to load in my browser. I checked the Apache access log to see what it was showing in regards to site activity, since I had noticed several web crawlers, aka web spiders, accessing the site yesterday. Checking the site's Apache access log for today, I saw that it was being crawled by four web spiders simultaneously: baiduspider, bingbot, MegaIndex.ru, and the linkdexbot. The first, baiduspider, is associated with the Chinese search engine company, Baidu, and the second, bingbot, is Microsoft's web crawler, which is used by its Bing search engine. The latter two, MegaIndex.ru and linkdexbot are associated with search engine optimization (SEO) companies.

[ More Info ]

[/network/web/crawlers] permanent link

Sat, Mar 22, 2014 2:10 pm

Renamed Website Files Still Being Crawled

I've noticed in the site's error logs that files that haven't existed on the site for years are producing error entries when web crawers still attempt to access them. Apparently, elsewhere on the web that are still links pointing to the nonexistent files, which has led me to conclude that I need to create redirects for those files on the site that I move or rename, if the files have been on the site for any significant lengthh of time.

[ More Info ]

[/network/web/crawlers] permanent link

Tue, Aug 05, 2008 10:15 pm

Turnitin Crawler

While troubleshooting a problem with a website using wireshark, I was capturing HTTP traffic. I noticed a connection from 65.98.224.2 with the contents of the first packet received from that address showing the software accessing my support website identifying itself as shown below:

User-Agent: TurnitinBot/2.1 (http://www.turnitin.com/robot/crawlerinfo.html)

Checking the URL listed, I found the following:

Chances are that you are reading this because you found a reference to this web page from your web server logs. This reference was left by Turnitin.com's web crawling robot, also known as TurnitinBot. This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities. For more information on this service, please visit www.turnitin.com

The Wikipedia article on Turnitin states that it is as "an Internet-based plagiarism-detection service created by iParadigms, LLC. Institutions (typically universities and high schools) buy licenses to submit essays to the Turnitin website, which checks the document for plagiarism."

I had read that many schools now use such services to deter students from submitting plagiarized papers. I've seen services offerring pre-written papers for students to submit for classes, so I can see the need for teachers to use such detection services. I didn't realize this service crawled websites to index materials on the web as part of its detection efforts, but it makes sense to me that the service would do so. This is the first time I've noticed this particular web crawler

[/network/web/crawlers] permanent link

Sat, Apr 23, 2016 10:53 pm

Sat, Mar 22, 2014 2:10 pm

Tue, Aug 05, 2008 10:15 pm