MoonPoint Support Weblog

Make wget Pretend to Be Internet Explorer

I have a script that I manually run to download a particular webpage based on a parameter that I submit to the script. The script downloads the webpage with wget then parses the webpage for specific information and displays only that information. The script had been running fine until today, but produced an error message when I ran it today. When I checked the information being retrieved by wget, I found that instead of the desired webpage, I was getting "Sorry. This page may not be spidered."

When a browser retrieves a webpage, it sends a set of values to the webserver. Those values, which are called "headers", include a "user-agent" header that identifies the browser to the server. E.g. a particular version of Internet Explorer may identify itself as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)".

Some websites may use the user-agent header for statistical purposes, e.g. to determine which browsers are most commonly used to access the website. Such information may help a web developer tailor the site to the ones most commonly used to view the site. Or the the website developer can use the information to tailor its output to the browser being used by a particular user. E.g., if a browser doesn't support a particular feature used in the code on the website, the website software can present the viewer with an alternative webpage.

Wget identifies itself as "wget x.y.z", where x.y.z is the version of wget in use, e.g. "wget 1.8.2". So, if you retrieve a webpage with wget, the webserver might see User-Agent: Wget/1.8.2" as one of the headers submitted to it by the browser.

In this case the website, where the page resided I wanted to access, was seeing User-Agent: Wget/1.8.2" and denying access to the page. Fortunately, you can use the --user-agent argument for wget to specify that wget announce itself to a webserver as any browser you might wish to emulate.

-U agent-string
       --user-agent=agent-string
           Identify as agent-string to the HTTP server.

           The HTTP protocol allows the clients to identify themselves using a
           "User-Agent" header field.  This enables distinguishing the WWW
           software, usually for statistical purposes or for tracing of proto-
           col violations.  Wget normally identifies as Wget/version, version
           being the current version number of Wget.

           However, some sites have been known to impose the policy of tailor-
           ing the output according to the "User-Agent"-supplied information.
           While conceptually this is not such a bad idea, it has been abused
           by servers denying information to clients other than "Mozilla" or
           Microsoft "Internet Explorer".  This option allows you to change
           the "User-Agent" line issued by Wget.  Use of this option is dis-
           couraged, unless you really know what you are doing.

I had wget pretend to be Internet Explorer by using the command below:

wget --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" --quiet --output-document=$outfile $url

After editing my script to use the --user-agent option, the script was able to download the webpage as before, placing the output in the file designated by the $outfile variable in the script and using the URL I specified as an argument to the script.

References:

Masquerading Your Browser
By Eric Giguere
September 19, 2003
Updated October 28, 2004
ericgiguère.com resources for software developers

[/network/web/tools/wget] permanent link

Thu, Jan 31, 2008 4:59 pm