I need to review and approve firewall rule requests. I can get a list of those pending approval, ones requesting a modification to existing rules, those pending removal because they've expired (all rules must be reviewed on at least a yearly basis), ones on hold, ones that are in a "clarification required" state due to a question about the rules being requested, those approved for implementation, but not yet implemented, and those awaiting removal. But the page that displays the requests in each category doesn't give me a count of the number in each category, so I wrote a Python script that will read the webpage containing that information that I've downloaded, parse the HTML code for the page and give me a count of the number of requests in each category.
The script is named count_queued.py. The name of the HTML file that was downloaded should be provided on the command line. E.g.:
$ ./count_queued.py ~/Documents/Work/queued/Request.html Request Status Pending Approval: 69 Modified: 36 Pending Removal 43 On Hold: 0 Clarification Required: 28 Waiting Implementation: 26 Waiting Removal: 12 Total requiring review: 176 $
If no input file is provided on the command line, an error message will be displayed.
$ ./count_queued.py Error - missing input file name! Usage ./count_queued.py filename
The script code is shown below:
#!/usr/bin/python # Version: 1.1 # Last modified: 2017-04-04 # Purpose: Count the number of requests in the various queues, e.g., # modified, awaiting implementation, etc. import re, sys section="" sections={} sections["Pending-Approval"] = 0 sections["Modified"] = 0 sections["Pending-Removal"] = 0 sections["On Hold"] = 0 sections["Clarification-Required"] = 0 sections["Waiting-Implementation"] = 0 sections["Waiting-Removal"] = 0 try: sys.argv[1] except IndexError: print "Error - missing input file name! Usage ./count_queued.py filename" sys.exit(1) else: requestfile = sys.argv[1] with open(requestfile, "r") as f: # Sections are begun by a line like: # >div class="standardbold" id="Pending-Approval">Requests Pending Approval:</div> for line in f: if "standardbold" in line: section = re.search('standardbold" id="(.+)">', line).group(1) else: if section is not "" and "a href=\"/Request/" in line: sections[section]+=1 print "Request Status\n" print "Pending Approval: ", sections["Pending-Approval"] print "Modified: ", sections["Modified"] print "Pending Removal ", sections["Pending-Removal"] print "On Hold: ", sections["On Hold"] print "Clarification Required:", sections["Clarification-Required"] print "Waiting Implementation:", sections["Waiting-Implementation"] print "Waiting Removal: ", sections["Waiting-Removal"] total = sections["Pending-Approval"] + sections["Modified"] + sections["Pending-Removal"] + sections["Clarification-Required"] print "\nTotal requiring review:", total
I use a
Python dictionary to track the number of entries in each category. The
dictionary is named sections
and is established with
sections={}
. I used a dictionary since it can be indexed by keys
that can be a string, such as "Pending-Approval", "Modified", etc.
I assign all dictionary entries a value of zero initially.
I import the sys
module to use for checking whether an
argument has been passed to the script on the command line. There should be
one argument, the name of the file containing the contents of the web page I
downloaded. The location and name of the input file, which is the argument
sys.argv[1]
provided on the command line, is stored in the
variable named requestfile
. The line with
open(requestfile, "r") as f:
opens the input file for reading and
for line in f:
reads in the file line by line.
I import the re
module because it can be used to search for a substring within a
string. "RE" stands for
"regular
expression." In this case, I need to search for specific text
on lines in the file I will read in with the script. I can use
re.search('searchString', string_to_search).group(1)
to find the text for which I'm looking on each line. In this case I
set the variable section
to be the result of the search
re.search('standardbold" id="(.+?)">', line).group(1)
, since the
text for which I'm searching occurs after standardbold" id="
on
any lines that mark the beginning of a section, where a section will be
Pending-Removal, Modified, etc. I'm interested in the value for
id
on the line and one of those section names will be after
id=
and within the double quotes that follow that text. E.g.,
id="Modified"
. The section name is followed by ">
.
E.g., a line might have <div class="standardbold" id="Modified">
, but I only want to extract the section, i.e., queue name, that appears
within the double quotes. I don't know speficically how many letters will be
in the name, e.g., it could be "Modified" or "Waiting-Removal". I can represent
that substring with .+?
, which is a
regular expression. The period represents any character and the plus sign
indicates that there can be one or more occurrences of any character. The plus
sign like the question mark in a regular expressions, is a "quantifier", which
has the following meaning:
A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk * (derived from the Kleene star), and the plus sign + (Kleene plus).
? The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour". * The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on. + The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". {n}[19] The preceding item is matched exactly n times. {min,}[19] The preceding item is matched min or more times. {min,max}[19] The preceding item is matched at least min times, but not more than max times.
Putting
parentheses around the .+
assigns whatever text is
found to a "group", in this case there is only one set of parentheses and one
group.
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].
So I assign the characters found to the
variable named section. If the line doesn't contain a
section identifier line, then if a section identifier line has already been
found and the line contains the text that identifies each line that contains
a request number, i.e., it contains a href="/Request/"
then the
the relevant dictionary entry has one added to the current number for that entry.
When all lines of the input file have been read, I print the contents of each dictionary entry and then print a total of all entries for which I may need to take action, so they can be implemented or the rules associated with the request removed from relevant firewalls.
I could also have used the following code, instead, to print the entries in the dictionary, i.e., the number of requests in each category.
print "Request Status\n" for name, value in sections.items(): print name + ":", value
That code assigns the name for each key to the variable name
and the value for that key in the variable value
looping through
each entry to print its name and value. But that doesn't allow me to control
the order in which entries are printed.
Request Status On Hold: 0 Pending-Approval: 69 Modified: 36 Pending-Removal: 43 Waiting-Implementation: 26 Waiting-Removal: 12 Clarification-Required: 28 Total requiring review: 176