I need to download and view Excel workbooks, Portable Document Format (PDF), and sometimes other types of documents related to work requests from a website. When I view a webpage for a particular work request, some of the documents may have been posted weeks or months ago while some are more recent, but when I download them they all get the timestamp of the time I downloaded them, but I would like to change the timestamps to match the timestamps on the website. On the webpage for a particular work request, I see the following information for files:
Type | Name | Site | Modified | Modified By |
---|---|---|---|---|
CRQ000000884164_PDSO | -None- | 3/6/2017 4:53 PM | Smith, Gary | |
CRQ000000884164_PDSO | -None- | 3/6/2017 4:52 PM | Smith, Gary | |
CRQ000000884164_DCS | -None- | 6/12/2017 9:29 AM | Doe, Mike | |
CRQ 884164_SDP | -None- | 6/12/2017 9:30 AM | Doe, Mike |
I click on the name to download the file and it is stored on my system with a timestamp reflecting the time I downloaded the file. A webpage for a request may have many more files than just those four listed in the above example. As a first step, I wanted to extract the file names from the webpage, so I downloaded a webpage for a request; in the Firefox browser you can do that by clicking on File then selecting Save Page As. I selected "Web Page, HTML only" as the format, since I don't need any graphics, etc. from the page, just the HTML code. To extract the file names I use the following code:
#!/usr/bin/python import os, re, sys try: sys.argv[1] except IndexError: print "Error - missing CRQ number! Usage ./getTimeStamps.py crq_num infile" sys.exit(1) else: crq = sys.argv[1] try: sys.argv[2] except IndexError: print "Error - missing input file name! Usage ./getTimeStamps.py infile" sys.exit(1) else: infile = sys.argv[2] # Check on whether file exists and is accessible if not os.path.isfile(infile): print "Error - input file", infile, "is not accessible!" sys.exit(1) else: f = open(infile, "r") # The lines containing links to files are in the form # href="/sites/cso/crew/Shared Documents/CRQ000000884164/CRQ000000884164_PDSO.xlsm" searchString = "/sites/cso/crew/Shared Documents/" + crq + "/" for line in f: for m in re.finditer(searchString, line): substring = line[m.end():None] quotePos = substring.find('"') filename = substring[0:quotePos] print filename
When I run the script, it will produce output like the following displaying
the files listed in the page I downloaded and named
CRQ000000884164-Documents.html
in this case:
$ ../getTimeStamps.py CRQ000000884164 ../CRQ000000884164-Documents.html CRQ000000884164_PDSO.xlsm CRQ000000884164_PDSO.pdf CRQ000000884164_DCS.pdf CRQ 884164_SDP.xlsm 884164_PDSO-3.xlsm 884164_PDSO-2.xlsm $
I import the os and sys modules to allow me to check for the existence of the file specified on the command line and exit from the Python script with an error message if it isn't found or accessible by the script. I specify the Change Request (CRQ) number on the command line, because the number associated with each specific work request is part of the directory path in the Uniform Resource Locator (URL) for downloading a file associated with that request allowing me to search on on that request number to find places in the HTML code where there are links for downloading the files for a request. The first argument to the script is the CRQ number and the second is the location and name of the file that holds the HTML code for the webpage I downloaded.
After opening the input file containing the HTML code, for every line in
the file, I search for the "searchString", which is the
string,
i.e., text, that identifies the URL for the file. The file name immediately
follows searchString and the file name is followed by a double quote character
("
), which is followed by </a>
, but I only
need to look for that double quotation mark. The HTML code on the page
may have multiple instances of files on one line, i.e., there's no clean break
of one file per line in the code, so I use the
re module; "re"
is shorthand for
"regular
expression." Regular expressions provide a powerful way to parse
strings. The re module provides many functions, including the following one
- see re — Regular
expression operations for others.
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
The line of code for m in re.finditer(searchString, line):
finds
all instances of the search string in a particular line in the input file with
m.start()
noting the character position in the line where the
search string starts and m.end()
noting the ending character
position. E.g., if I added a print command, such as the one below immediately
after for m in re.finditer(searchString, line)
:
for m in re.finditer(searchString, line): print "Start: ", m.start(), " End: ", m.end()
Then, if I ran the script I would see the output below:
$ ../getTimeStamps.py CRQ000000884164 ../CRQ000000884164-Documents.html Start: 13272 End: 13321 CRQ000000884164_PDSO.xlsm Start: 16085 End: 16134 CRQ000000884164_PDSO.pdf Start: 18852 End: 18901 CRQ000000884164_DCS.pdf Start: 21634 End: 21683 CRQ 884164_SDP.xlsm Start: 30663 End: 30712 884164_PDSO-3.xlsm Start: 33443 End: 33492 884164_PDSO-2.xlsm $
The difference between the start and end values in each case is 49 characters, e.g. 13321 - 13272 = 49. That is because the string I'm searching on is forty-nine characters long. I.e.:
$ python Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> crq = "CRQ000000884164" >>> searchString = "/sites/cso/crew/Shared Documents/" + crq + "/" >>> len(searchString) 49 >>> exit() $
But I'm only interested in m.end()
for each occurrence of
searchString
, because I know that the file name starts at
position m.end()
. I don't know how long the file name may be,
but I know it is terminated by a double quote, i.e., "
. So I look
for the position of that character in the variable named substring
and set the value of quotePos
to be that position. I can then
set the file name to be the start of substring, i.e, position zero to the
position of the double quote in the whole long string representing the rest
of the line. I can then display the file name.
quotePos = substring.find('"') filename = substring[0:quotePos] print filename
Looping through every line in the file I can thus find all occurrences of the string I'm searching for throughout the file and find all instances of the string on any line when there are multiple occurrences of the string on any line.
After getting the file names from the webpage, I want to get the timestamps displayed on the page for each file as well. To do so, I close and reopen the input file and check every line again for any lines that have a timestamp in the form mm/dd/yyyy hh:mm followed by AM or PM, e.g., 3/6/2017 4:53 PM, since only one digit is used for days or months that are less than 10. Though looping through the file again may not be the most efficient means of extracting the timestamp for each file, since it is unlikely that there will even be a dozen files on the page, I'm not concerned about Central Processing Unit (CPU) time or the time to read lines from the file to extract information for such a small number of files. So I modified the code for the Python script as follows:
#!/usr/bin/python import os, re, sys try: sys.argv[1] except IndexError: print "Error - missing CRQ number! Usage ./getTimeStamps.py crq_num infile" sys.exit(1) else: crq = sys.argv[1] try: sys.argv[2] except IndexError: print "Error - missing input file name! Usage ./getTimeStamps.py infile" sys.exit(1) else: infile = sys.argv[2] # Check on whether file exists and is accessible if not os.path.isfile(infile): print "Error - input file", infile, "is not accessible!" sys.exit(1) else: f = open(infile, "r") # The lines containing links to files are in the form # href="/sites/cso/crew/Shared Documents/CRQ000000884164/CRQ000000884164_PDSO.xlsm" searchString = "/sites/cso/crew/Shared Documents/" + crq + "/" # The timestamps on files are in the form 7/27/2017 11:43 AM regexTimeStamp = "\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2} [AP]M" filenameList = [] for line in f: for m in re.finditer(searchString, line): substring = line[m.end():None] quotePos = substring.find('"') filename = substring[0:quotePos] filenameList.append(filename) f.close() timestampList = [] f = open(infile, "r") for line in f: for m in re.finditer(regexTimeStamp, line): timestamp = line[m.start():m.end()] timestampList.append(timestamp) i = 0 while i < len(filenameList): print filenameList[i], timestampList[i] i += 1
Since the timestamps on files are in the form 7/27/2017 11:43 AM I use the following regular expression (regexp) pattern for the search:
regexTimeStamp = "\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2} [AP]M"
The \d
represents any digit, i.e., 0 through 9, and the two
numbers in curly braces after \d
indicate the minimum and maximum
occurrences expected for the prior character, so \d{1,2}/
tells
Python to loook for any digit that occurs one or two times followed by a
slash character. Since four digits are used for the year, I use
\d{4}/
to specify that exactly four digits followed by a slash
must be found. The square brackets are used to enclose characters when any
of the characters within the brackets is an acceptable match, so
[AP]M
will match "AM" or "PM". For more information on regular
expression matching you can read the Wikipedia
regular expression
article or see
Regular
expression syntax.
Since the timestamp values will only occur next to a file name on the webpage, I create two lists: one the first time I loop through the file to look for files and then another list when I loop through the lines in the file to search for timestamps. I then print the file names and timestamps together for each file, so the output will be as in the example below:
$ ~/Documents/getTimeStamps.py CRQ000000884164 CRQ000000884164-Documents.html CRQ000000884164_PDSO.xlsm 3/6/2017 4:53 PM CRQ000000884164_PDSO.pdf 3/6/2017 4:52 PM CRQ000000884164_DCS.pdf 6/12/2017 9:29 AM CRQ 884164_SDP.xlsm 6/12/2017 9:30 AM