MoonPoint Support Weblog

YandexBot Web Crawler

When checking my website logs to see if there were any entries indicating it had been "crawled", i.e., indexed, by DuckDuckGo, I found that there were no log entries for any of the IP addresses used by the DuckDuckGoBot for indexing webpages for 2013 nor for 2014. I found at DuckDuckGo's Sources webpage that though the search engine has its own web crawler, it relies heavily on indexes produced by the web crawlers for other search engines stating:

DuckDuckGo gets its results from over one hundred sources, including DuckDuckBot (our own crawler), crowd-sourced sites (like Wikipedia, which are stored in our own index), Yahoo! (through BOSS), Yandex, WolframAlpha, and Bing.

DuckDuckGo's page states they apply their own algorithm to rank results obtained from other search engines upon which they rely for data.

One of the search engines mentioned was Yandex. The Yandex search engine, Yandex Search, can be accessed at www.yandex.com. According to the Wikipedia articles for Yandex and Yandex Search the company operates the largest search engine in Russia with about 60% market share in Russia with its search engine generating 64% of all Russian web search traffic in 2010. The article on the company also states:

Yandex ranked as the 4th largest search engine worldwide, based on information from Comscore.com, with more than 150 million searches per day as of April 2012, and more than 50.5 million visitors (all company's services) daily as of February 2013.

The article also indicates Yandex is heavily utilized in Ukraine and Kazakhstan, providing nearly a 1/3 of all search results in those countries and 43% of all search results in Belarus.

When I searched the logs for this year for this website, I found quite a few entries indicating the site had been indexed by the Yandex web crawler. I.e., there were many entries containing the following:

"Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"

In the homepage for this site, I include PHP code to notify me whenever Google's Googlebot indexes the site, so I updated that code to include a check that will lead to an email alert being sent to me whenever the YandexBot indicates the site, also.


<?php
$email = "me@example.com";
if( eregi("googlebot", $_SERVER['HTTP_USER_AGENT']) )
{
    mail($email, "Googlebot Alert",
            "Google just indexed your following page: " .
            $_SERVER['REQUEST_URI']);
}

if( eregi("YandexBot", $_SERVER['HTTP_USER_AGENT']) )
{
    mail($email, "Yandex Alert",
            "Yandex just indexed your following page: " .
            $_SERVER['REQUEST_URI']);
}

?>

[/network/web/search] permanent link

Removing a site from search results

If you don't wish to have any results returned for a particular site when you are performing a search using Google, Bing, Yahoo, or DuckDuckGo, you can include the option -site on the search line. E.g., if I wished to search for "accessing deleted wikipedia pages", but didn't want any results returned from Wikipedia.org, I could use the following search terms:

accessing deleted wikipedia pages -site:wikipedia.org

If you wish to include only results for a particular site, then you would put the site's name after the word site, e.g., if I wished to search just moonpoint.com, I could use the following:

accessing deleted wikipedia pages site:moonpoint.com

If you restrict searches using the site option, if you use a domain name such as moonpoint.com, results will also be returned for any domain names that include the specified domain name at the end of the domain name, e.g., in this case anything on www.moonpoint.com or support.moonpoint.com would also be returned. The same is true when using the -site option, i.e., no results would be returned for en.wikipedia.org or www.wikipedia.org in the first example.

[/network/web/search] permanent link

Mon, Mar 03, 2014 7:31 pm

Mon, Mar 03, 2014 5:17 pm