If you wish to download a web page with a Python script, you can imput the
urllib2
module into a Python
script as explained at
Downloading a web page with Python. I've modified the script posted there
to allow the webpage
URL
and output file name to be specified as command line arguments to the
script:#!/usr/bin/python # download_page # download a webpage to a specified file. The script takes two parameters: # the URL of the page to download and a file name to be used to hold # the downloaded web page. import urllib2, sys try: sys.argv[1] except IndexError: print "Error - URL missing! Usage: ./download_page.py download_page_url outfile" sys.exit(1) else: url = sys.argv[1] try: sys.argv[2] except IndexError: print "Error - missing output file name! Usage: ./download_page.py download_page_url outfile" sys.exit(1) else: outfile = sys.argv[2] page = urllib2.urlopen(url) source = page.read() downloadFile = open(outfile, 'w') downloadFile.write(source) downloadFile.close()
The sys
module is imported to check the command line
arguments using sys.argv[x]
, where x. is the
number specifying the argument; sys.argv[0]
is always the
name of the script itself, in this case
download_page.py, so sys.argv[1]
should be the URL of
the webpage to be saved and sys.argv[2]
the file name for
the output file. The file name can contain a location for the output file, e.g.,
mydir/somepage.html
. If a directory is specified with the
file name, the script doesn't check to ensure the directory exists and
will exit with a Python "No such file or directory" error message should
that error occur. If no directory path is included with the file name,
the directory from which the script is run will be used to store the
downloaded webpage.
The script will print error messages if the URL and output file name are
omitted from the command line. It can be run using python
./download_page.py
or ./download_page.py
, if for the latter
option you have first changed the
file permissions on the program to mark it as executable, e.g., with
chmod 755 download_page.py
.