Using Firefox Cookies with Wget
If you need to use
wget to
access a site that relies on
HTTP cookies
to control access to the site, you can log into the site with Firefox and
use the Firefox add-on
Export Cookies to export all of the cookies stored by Firefox to a
file, e.g.
cookies.txt
. After installing the add-on,
restart Firefox. You can then click on
Tools and choose
Export Cookies. Note: you may not get the cookie you need, if
you put Firefox in private browsing mode.
You can then use the cookies file you just exported with wget. E.g. presuming
the cookies file was named cookies.txt
and was in the same
directory as wget, you could use the following:
wget --load-cookies=cookies.txt http://example.com/somepage.html
[/network/web/tools/wget]
permanent link
Determining if Wget Supports SSL
I needed to write a
Bash
script that will use
Wget
to download webpages from a secure website using
HTTPS. In order to be
able to use Wget for this purpose, one needs to have access to a version
of Wget compiled with
SSL
support. You can determine if wget on a particular system was compiled with
SSL support using the command
wget --help | grep HTTPS
.
Output on a system where wget has SSL support:
$ wget --help | grep HTTPS
HTTPS (SSL/TLS) options:
Output on a system where wget does not have SSL support:
$ wget --help | grep HTTPS
$
[/network/web/tools/wget]
permanent link
Make wget Pretend to Be Internet Explorer
I have a script that I manually run to download a particular webpage based
on a parameter that I submit to the script. The script downloads the
webpage with
wget
then parses the
webpage for specific information and displays only that information.
The script had been running fine until today, but produced an error message
when I ran it today. When I checked the information being retrieved by
wget, I found that instead of the desired webpage, I was getting
"Sorry. This page may not be spidered."
When a browser retrieves a webpage, it sends a set of values to the webserver.
Those values, which are called "headers", include a "user-agent" header
that identifies the browser to the server. E.g. a particular version of
Internet Explorer may identify itself as "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.0)".
Some websites may use the user-agent header for statistical purposes, e.g.
to determine which browsers are most commonly used to access the website.
Such information may help a web developer tailor the site to the ones most
commonly used to view the site. Or the the website developer can use the
information to tailor its output to the browser being used by a particular
user. E.g., if a browser doesn't support a particular feature used in the
code on the website, the website software can present the viewer with
an alternative webpage.
Wget identifies itself as "wget x.y.z", where x.y.z is the version of wget
in use, e.g. "wget 1.8.2". So, if you retrieve a webpage with wget, the
webserver might see User-Agent: Wget/1.8.2"
as one of the
headers submitted to it by the browser.
In this case the website, where the page resided I wanted to access, was
seeing User-Agent: Wget/1.8.2"
and denying access to the
page. Fortunately, you can use the --user-agent
argument for
wget to specify that wget announce itself to a webserver as any browser
you might wish to emulate.
-U agent-string
--user-agent=agent-string
Identify as agent-string to the HTTP server.
The HTTP protocol allows the clients to identify themselves using a
"User-Agent" header field. This enables distinguishing the WWW
software, usually for statistical purposes or for tracing of proto-
col violations. Wget normally identifies as Wget/version, version
being the current version number of Wget.
However, some sites have been known to impose the policy of tailor-
ing the output according to the "User-Agent"-supplied information.
While conceptually this is not such a bad idea, it has been abused
by servers denying information to clients other than "Mozilla" or
Microsoft "Internet Explorer". This option allows you to change
the "User-Agent" line issued by Wget. Use of this option is dis-
couraged, unless you really know what you are doing.
I had wget pretend to be Internet Explorer by using the command below:
wget --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" --quiet --output-document=$outfile $url
After editing my script to use the --user-agent
option, the
script was able to download the webpage as before, placing the output
in the file designated by the $outfile
variable in the
script and using the URL I specified as an argument to the script.
References:
-
Masquerading Your Browser
By Eric Giguere
September 19, 2003
Updated October 28, 2004
ericgiguère.com resources
for software developers
[/network/web/tools/wget]
permanent link