Find all occurrences of a string in a file using Python

I need to download and view Excel workbooks, Portable Document Format (PDF), and sometimes other types of documents related to work requests from a website. When I view a webpage for a particular work request, some of the documents may have been posted weeks or months ago while some are more recent, but when I download them they all get the timestamp of the time I downloaded them, but I would like to change the timestamps to match the timestamps on the website. On the webpage for a particular work request, I see the following information for files:

TypeNameSiteModified Modified By
xlsm iconCRQ000000884164_PDSO -None- 3/6/2017 4:53 PM Smith, Gary
pdf iconCRQ000000884164_PDSO -None- 3/6/2017 4:52 PMSmith, Gary
pdf iconCRQ000000884164_DCS -None- 6/12/2017 9:29 AMDoe, Mike
xlsm iconCRQ 884164_SDP -None- 6/12/2017 9:30 AMDoe, Mike

I click on the name to download the file and it is stored on my system with a timestamp reflecting the time I downloaded the file. A webpage for a request may have many more files than just those four listed in the above example. As a first step, I wanted to extract the file names from the webpage, so I downloaded a webpage for a request; in the Firefox browser you can do that by clicking on File then selecting Save Page As. I selected "Web Page, HTML only" as the format, since I don't need any graphics, etc. from the page, just the HTML code. To extract the file names I use the following code:



#!/usr/bin/python

import os, re, sys

try:
   sys.argv[1]
except IndexError:
   print "Error - missing CRQ number! Usage ./getTimeStamps.py crq_num infile"
   sys.exit(1)
else:
   crq = sys.argv[1]

try:
   sys.argv[2]
except IndexError:
   print "Error - missing input file name! Usage ./getTimeStamps.py infile"
   sys.exit(1)
else:
   infile = sys.argv[2]

# Check on whether file exists and is accessible
if not os.path.isfile(infile):
   print "Error - input file", infile, "is not accessible!" 
   sys.exit(1)
else:   
    f = open(infile, "r")


# The lines containing links to files are in the form
# href="/sites/cso/crew/Shared Documents/CRQ000000884164/CRQ000000884164_PDSO.xlsm"
searchString = "/sites/cso/crew/Shared Documents/" + crq + "/"

for line in f:
     for m in re.finditer(searchString, line):
         substring = line[m.end():None]
         quotePos = substring.find('"')
         filename = substring[0:quotePos]
         print filename

When I run the script, it will produce output like the following displaying the files listed in the page I downloaded and named CRQ000000884164-Documents.html in this case:

$ ../getTimeStamps.py CRQ000000884164 ../CRQ000000884164-Documents.html
CRQ000000884164_PDSO.xlsm
CRQ000000884164_PDSO.pdf
CRQ000000884164_DCS.pdf
CRQ 884164_SDP.xlsm
884164_PDSO-3.xlsm
884164_PDSO-2.xlsm
$

I import the os and sys modules to allow me to check for the existence of the file specified on the command line and exit from the Python script with an error message if it isn't found or accessible by the script. I specify the Change Request (CRQ) number on the command line, because the number associated with each specific work request is part of the directory path in the Uniform Resource Locator (URL) for downloading a file associated with that request allowing me to search on on that request number to find places in the HTML code where there are links for downloading the files for a request. The first argument to the script is the CRQ number and the second is the location and name of the file that holds the HTML code for the webpage I downloaded.

After opening the input file containing the HTML code, for every line in the file, I search for the "searchString", which is the string, i.e., text, that identifies the URL for the file. The file name immediately follows searchString and the file name is followed by a double quote character ("), which is followed by </a>, but I only need to look for that double quotation mark. The HTML code on the page may have multiple instances of files on one line, i.e., there's no clean break of one file per line in the code, so I use the re module; "re" is shorthand for "regular expression." Regular expressions provide a powerful way to parse strings. The re module provides many functions, including the following one - see re — Regular expression operations for others.

re.finditer(pattern, string, flags=0)

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

The line of code for m in re.finditer(searchString, line): finds all instances of the search string in a particular line in the input file with m.start() noting the character position in the line where the search string starts and m.end() noting the ending character position. E.g., if I added a print command, such as the one below immediately after for m in re.finditer(searchString, line):

Mastering Python
Mastering Python
1x1 px


Python by Example
Python by Example
1x1 px

     for m in re.finditer(searchString, line):
         print "Start: ", m.start(), " End: ", m.end()

Then, if I ran the script I would see the output below:

$ ../getTimeStamps.py CRQ000000884164 ../CRQ000000884164-Documents.html
Start:  13272  End:  13321
CRQ000000884164_PDSO.xlsm
Start:  16085  End:  16134
CRQ000000884164_PDSO.pdf
Start:  18852  End:  18901
CRQ000000884164_DCS.pdf
Start:  21634  End:  21683
CRQ 884164_SDP.xlsm
Start:  30663  End:  30712
884164_PDSO-3.xlsm
Start:  33443  End:  33492
884164_PDSO-2.xlsm
$

The difference between the start and end values in each case is 49 characters, e.g. 13321 - 13272 = 49. That is because the string I'm searching on is forty-nine characters long. I.e.:

Learn Python, it's CAKE1x1 px
$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> crq = "CRQ000000884164"
>>> searchString = "/sites/cso/crew/Shared Documents/" + crq + "/"
>>> len(searchString)
49
>>> exit()
$

But I'm only interested in m.end() for each occurrence of searchString, because I know that the file name starts at position m.end(). I don't know how long the file name may be, but I know it is terminated by a double quote, i.e., ". So I look for the position of that character in the variable named substring and set the value of quotePos to be that position. I can then set the file name to be the start of substring, i.e, position zero to the position of the double quote in the whole long string representing the rest of the line. I can then display the file name.

quotePos = substring.find('"')
filename = substring[0:quotePos]
print filename

Looping through every line in the file I can thus find all occurrences of the string I'm searching for throughout the file and find all instances of the string on any line when there are multiple occurrences of the string on any line.

After getting the file names from the webpage, I want to get the timestamps displayed on the page for each file as well. To do so, I close and reopen the input file and check every line again for any lines that have a timestamp in the form mm/dd/yyyy hh:mm followed by AM or PM, e.g., 3/6/2017 4:53 PM, since only one digit is used for days or months that are less than 10. Though looping through the file again may not be the most efficient means of extracting the timestamp for each file, since it is unlikely that there will even be a dozen files on the page, I'm not concerned about Central Processing Unit (CPU) time or the time to read lines from the file to extract information for such a small number of files. So I modified the code for the Python script as follows:

Udemy - April2516-25off-sitewide120x600
#!/usr/bin/python

import os, re, sys

try:
   sys.argv[1]
except IndexError:
   print "Error - missing CRQ number! Usage ./getTimeStamps.py crq_num infile"
   sys.exit(1)
else:
   crq = sys.argv[1]

try:
   sys.argv[2]
except IndexError:
   print "Error - missing input file name! Usage ./getTimeStamps.py infile"
   sys.exit(1)
else:
   infile = sys.argv[2]

# Check on whether file exists and is accessible
if not os.path.isfile(infile):
   print "Error - input file", infile, "is not accessible!" 
   sys.exit(1)
else:   
    f = open(infile, "r")

# The lines containing links to files are in the form
# href="/sites/cso/crew/Shared Documents/CRQ000000884164/CRQ000000884164_PDSO.xlsm"
searchString = "/sites/cso/crew/Shared Documents/" + crq + "/"
# The timestamps on files are in the form 7/27/2017 11:43 AM
regexTimeStamp = "\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2} [AP]M"

filenameList = []
for line in f:
     for m in re.finditer(searchString, line):
         substring = line[m.end():None]
         quotePos = substring.find('"')
         filename = substring[0:quotePos]
         filenameList.append(filename)
f.close()

timestampList = []
f = open(infile, "r")
for line in f:
     for m in re.finditer(regexTimeStamp, line):
         timestamp = line[m.start():m.end()]
         timestampList.append(timestamp)

i = 0
while i < len(filenameList):
    print filenameList[i], timestampList[i]
    i += 1

Since the timestamps on files are in the form 7/27/2017 11:43 AM I use the following regular expression (regexp) pattern for the search:

regexTimeStamp = "\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2} [AP]M"

The \d represents any digit, i.e., 0 through 9, and the two numbers in curly braces after \d indicate the minimum and maximum occurrences expected for the prior character, so \d{1,2}/ tells Python to loook for any digit that occurs one or two times followed by a slash character. Since four digits are used for the year, I use \d{4}/ to specify that exactly four digits followed by a slash must be found. The square brackets are used to enclose characters when any of the characters within the brackets is an acceptable match, so [AP]M will match "AM" or "PM". For more information on regular expression matching you can read the Wikipedia regular expression article or see Regular expression syntax.

Since the timestamp values will only occur next to a file name on the webpage, I create two lists: one the first time I loop through the file to look for files and then another list when I loop through the lines in the file to search for timestamps. I then print the file names and timestamps together for each file, so the output will be as in the example below:

$ ~/Documents/getTimeStamps.py CRQ000000884164 CRQ000000884164-Documents.html
CRQ000000884164_PDSO.xlsm 3/6/2017 4:53 PM
CRQ000000884164_PDSO.pdf 3/6/2017 4:52 PM
CRQ000000884164_DCS.pdf 6/12/2017 9:29 AM
CRQ 884164_SDP.xlsm 6/12/2017 9:30 AM