Web Spiders Crawling Site on 2016-04-23

When I viewed a page on the site this morning, the page took a long time to load in my browser. I checked the Apache access log to see what it was showing in regards to site activity, since I had noticed several web crawlers, aka web spiders, accessing the site yesterday. Checking the access log with the tail command, I saw the following entries as the last few entries in the log at the time I checked it: - - [23/Apr/2016:09:28:55 -0400] "GET /downloads/windows/network/Em
ail/simplecheck.php HTTP/1.1" 200 7580 "-" "Mozilla/5.0 (compatible; Baiduspider
/2.0; +http://www.baidu.com/search/spider.html)" - - [23/Apr/2016:09:28:34 -0400] "GET /blog//blosxom/2014/03/06/ HT
TP/1.1" 200 14776 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.co
m/bingbot.htm)" - - [23/Apr/2016:09:28:44 -0400] "GET /blog//blosxom/2013/06/29 HTTP/
1.1" 200 16491 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex
.com/crawler)" - - [23/Apr/2016:09:28:58 -0400] "GET /blog/blosxom/2010/ HTTP/1.1"
 200 263330 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bing
bot.htm)" - - [23/Apr/2016:09:29:06 -0400] "GET /blog//blosxom/2013/06/28 HTTP/
1.1" 200 15890 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex
.com/crawler)" - - [23/Apr/2016:09:29:18 -0400] "GET /info/tools/mt500w_work_light
/697054-2T.jpg HTTP/1.1" 200 6532 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +ht
tp://www.bing.com/bingbot.htm)" - - [23/Apr/2016:09:29:20 -0400] "GET /blog//blosxom/2013/06/23 HTTP/
1.1" 200 13266 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex
.com/crawler)" - - [23/Apr/2016:09:29:42 -0400] "GET /template/ads/commjunction/or
eilly_200x200 HTTP/1.1" 200 334 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http
://www.bing.com/bingbot.htm)" - - [23/Apr/2016:09:29:35 -0400] "GET /blog//blosxom/os/windows/softw
are/network/snmp HTTP/1.1" 200 17988 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/
2.0; +http://megaindex.com/crawler)" - - [23/Apr/2016:09:29:53 -0400] "GET / HTTP/1.1" 200 2972 "-" "M
ozilla/5.0 (compatible; linkdexbot/2.0; +http://www.linkdex.com/bots/)"

I.e., all of the entries were for web crawlers. The first one, the Baiduspider, is run by Baidu, a Chinese web services and search engine company. Currently, Alexa Interent, a subsidiary of Amazon.com that rates domain names in terms of popularity in the world and the country associated with the domain name, ranks baidu.com as the fourth most popular website in the world and number one in China.

The next crawler I saw was bingbot, Microsoft's web crawler. A web crawler is an instance of a software agent that is sometimes referred to as a "bot" as a shorthand for "software robot", which is why web crawers often have "bot" as part of their name.

Note: sometimes some spider accessing your site may not be associated with the entity with which it purports to be associated. Microsoft has an article, How to Verify that Bingbot is Bingbot on how you can use reverse DNS lookups to check on whether an IP address you see in your logs that purports to be associated with Microsoft's web crawer, Bingbot, is actually associated with that bot.

Linkdex is a search engine optimization (SEO) company founded in 2009 with personnel in London, New York, and Los Angeles.

MegaIndex is also a SEO company, which notes on its website:

It is an inbound marketing automation system that provides strong web analytics data. The service has its own crawler, which collects data about websites all over the Internet. MegaIndex analyses and processes the data and offers it to you in the form of comprehensive online marketing and SEO reports containing the key online performance indicators of any website including inbound links profile, SERP visibility, views by keyword and more. MegaIndex contains SERP history for more than 120,000,000 keywords.

I noted that the log entries associated with it refer to MegaIndex.ru. RU is the country code for Russia, but when I performed a reverse DNS lookup on the IP address, I saw a country code of DE, which is the country code for Germany. RIPE, the European regional internet registry shows the IP address I saw in the site's access log has been assigned to Hetzner Online AG in Germany - see the RIPE Database Query for

Alexa currently is reporting a worldwide ranking of 33,379 for megaindex.com and a rank in Russia of 2,483. The MegaIndex site provides the capability for a website owner or developer to see the backlinks, i.e., the links from other sites to his or or her site, MegaIndex has found.

Some details for the web spiders I saw crawling the site are listed below:

IP AddressFQDNSpiderEntity Alexa World Ranking Baiduspider Baidu 4 Bingbot Microsoft 39 MegaIndex.ru MegaIndex 33,379 linkdexbot Linkdex 116,184

Note: the Alexa world ranking for a site may fluctuate on a daily basis. Lower numbers are better in terms of the popularity ranking. The caveat is that Alexa is usually estimating the ranking based on its own users, i.e., the ranking is a somewhat selective measure - see How are Alexa’s traffic rankings determined?, so taking the numbers with a grain of salt is called for, but they can give you a rough idea of a site's popularity versus the popularity of other sites.


