MoonPoint Support Weblog

Using Firefox Cookies with Wget

If you need to use wget to access a site that relies on HTTP cookies to control access to the site, you can log into the site with Firefox and use the Firefox add-on Export Cookies to export all of the cookies stored by Firefox to a file, e.g. cookies.txt. After installing the add-on, restart Firefox. You can then click on Tools and choose Export Cookies. Note: you may not get the cookie you need, if you put Firefox in private browsing mode.

You can then use the cookies file you just exported with wget. E.g. presuming the cookies file was named cookies.txt and was in the same directory as wget, you could use the following:

wget --load-cookies=cookies.txt http://example.com/somepage.html

[/network/web/tools/wget] permanent link

Tue, Oct 06, 2009 4:41 pm

Determining if Wget Supports SSL

I needed to write a Bash script that will use Wget to download webpages from a secure website using HTTPS. In order to be able to use Wget for this purpose, one needs to have access to a version of Wget compiled with SSL support. You can determine if wget on a particular system was compiled with SSL support using the command wget --help | grep HTTPS.

Output on a system where wget has SSL support:

$ wget --help | grep HTTPS
HTTPS (SSL/TLS) options:

Output on a system where wget does not have SSL support:

$ wget --help | grep HTTPS
$

[/network/web/tools/wget] permanent link

Thu, Jan 31, 2008 4:59 pm

Make wget Pretend to Be Internet Explorer

I have a script that I manually run to download a particular webpage based on a parameter that I submit to the script. The script downloads the webpage with wget then parses the webpage for specific information and displays only that information. The script had been running fine until today, but produced an error message when I ran it today. When I checked the information being retrieved by wget, I found that instead of the desired webpage, I was getting "Sorry. This page may not be spidered."

When a browser retrieves a webpage, it sends a set of values to the webserver. Those values, which are called "headers", include a "user-agent" header that identifies the browser to the server. E.g. a particular version of Internet Explorer may identify itself as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)".

Some websites may use the user-agent header for statistical purposes, e.g. to determine which browsers are most commonly used to access the website. Such information may help a web developer tailor the site to the ones most commonly used to view the site. Or the the website developer can use the information to tailor its output to the browser being used by a particular user. E.g., if a browser doesn't support a particular feature used in the code on the website, the website software can present the viewer with an alternative webpage.

Wget identifies itself as "wget x.y.z", where x.y.z is the version of wget in use, e.g. "wget 1.8.2". So, if you retrieve a webpage with wget, the webserver might see User-Agent: Wget/1.8.2" as one of the headers submitted to it by the browser.

In this case the website, where the page resided I wanted to access, was seeing User-Agent: Wget/1.8.2" and denying access to the page. Fortunately, you can use the --user-agent argument for wget to specify that wget announce itself to a webserver as any browser you might wish to emulate.

-U agent-string
       --user-agent=agent-string
           Identify as agent-string to the HTTP server.

           The HTTP protocol allows the clients to identify themselves using a
           "User-Agent" header field.  This enables distinguishing the WWW
           software, usually for statistical purposes or for tracing of proto-
           col violations.  Wget normally identifies as Wget/version, version
           being the current version number of Wget.

           However, some sites have been known to impose the policy of tailor-
           ing the output according to the "User-Agent"-supplied information.
           While conceptually this is not such a bad idea, it has been abused
           by servers denying information to clients other than "Mozilla" or
           Microsoft "Internet Explorer".  This option allows you to change
           the "User-Agent" line issued by Wget.  Use of this option is dis-
           couraged, unless you really know what you are doing.

I had wget pretend to be Internet Explorer by using the command below:

wget --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" --quiet --output-document=$outfile $url

After editing my script to use the --user-agent option, the script was able to download the webpage as before, placing the output in the file designated by the $outfile variable in the script and using the URL I specified as an argument to the script.

References:

Masquerading Your Browser
By Eric Giguere
September 19, 2003
Updated October 28, 2004
ericgiguère.com resources for software developers

[/network/web/tools/wget] permanent link

Sun, Jan 06, 2013 2:48 pm

Tue, Oct 06, 2009 4:41 pm

Thu, Jan 31, 2008 4:59 pm