I have a script that I manually run to download a particular webpage based on a parameter that I submit to the script. The script downloads the webpage with
wget
then parses the
webpage for specific information and displays only that information.
The script had been running fine until today, but produced an error message
when I ran it today. When I checked the information being retrieved by
wget, I found that instead of the desired webpage, I was getting
"Sorry. This page may not be spidered."
When a browser retrieves a webpage, it sends a set of values to the webserver. Those values, which are called "headers", include a "user-agent" header that identifies the browser to the server. E.g. a particular version of Internet Explorer may identify itself as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)".
Some websites may use the user-agent header for statistical purposes, e.g. to determine which browsers are most commonly used to access the website. Such information may help a web developer tailor the site to the ones most commonly used to view the site. Or the the website developer can use the information to tailor its output to the browser being used by a particular user. E.g., if a browser doesn't support a particular feature used in the code on the website, the website software can present the viewer with an alternative webpage.
Wget identifies itself as "wget x.y.z", where x.y.z is the version of wget
in use, e.g. "wget 1.8.2". So, if you retrieve a webpage with wget, the
webserver might see User-Agent: Wget/1.8.2"
as one of the
headers submitted to it by the browser.
In this case the website, where the page resided I wanted to access, was
seeing User-Agent: Wget/1.8.2"
and denying access to the
page. Fortunately, you can use the --user-agent
argument for
wget to specify that wget announce itself to a webserver as any browser
you might wish to emulate.
-U agent-string
--user-agent=agent-string
Identify as agent-string to the HTTP server.
The HTTP protocol allows the clients to identify themselves using a
"User-Agent" header field. This enables distinguishing the WWW
software, usually for statistical purposes or for tracing of proto-
col violations. Wget normally identifies as Wget/version, version
being the current version number of Wget.
However, some sites have been known to impose the policy of tailor-
ing the output according to the "User-Agent"-supplied information.
While conceptually this is not such a bad idea, it has been abused
by servers denying information to clients other than "Mozilla" or
Microsoft "Internet Explorer". This option allows you to change
the "User-Agent" line issued by Wget. Use of this option is dis-
couraged, unless you really know what you are doing.
I had wget pretend to be Internet Explorer by using the command below:
wget --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" --quiet --output-document=$outfile $url
After editing my script to use the --user-agent
option, the
script was able to download the webpage as before, placing the output
in the file designated by the $outfile
variable in the
script and using the URL I specified as an argument to the script.
References:
-
Masquerading Your Browser
By Eric Giguere
September 19, 2003
Updated October 28, 2004
ericgiguère.com resources for software developers