Counting queued items with a Python script

I need to review and approve firewall rule requests. I can get a list of those pending approval, ones requesting a modification to existing rules, those pending removal because they've expired (all rules must be reviewed on at least a yearly basis), ones on hold, ones that are in a "clarification required" state due to a question about the rules being requested, those approved for implementation, but not yet implemented, and those awaiting removal. But the page that displays the requests in each category doesn't give me a count of the number in each category, so I wrote a Python script that will read the webpage containing that information that I've downloaded, parse the HTML code for the page and give me a count of the number of requests in each category.

The script is named count_queued.py. The name of the HTML file that was downloaded should be provided on the command line. E.g.:

Udemy - April2516-25off-sitewide120x600

$ ./count_queued.py ~/Documents/Work/queued/Request.html
Request Status

Pending Approval:       69
Modified:               36
Pending Removal         43
On Hold:                0
Clarification Required: 28
Waiting Implementation: 26
Waiting Removal:        12

Total requiring review: 176
$

If no input file is provided on the command line, an error message will be displayed.

$ ./count_queued.py
Error - missing input file name! Usage ./count_queued.py filename

The script code is shown below:

#!/usr/bin/python

# Version: 1.1
# Last modified: 2017-04-04
# Purpose: Count the number of requests in the various queues, e.g.,
# modified, awaiting implementation, etc.

import re, sys

section=""
sections={}
sections["Pending-Approval"] = 0
sections["Modified"] = 0
sections["Pending-Removal"] = 0
sections["On Hold"] = 0
sections["Clarification-Required"] = 0
sections["Waiting-Implementation"] = 0
sections["Waiting-Removal"] = 0

try:
   sys.argv[1]
except IndexError:
   print "Error - missing input file name! Usage ./count_queued.py filename"
   sys.exit(1)
else:
   requestfile = sys.argv[1]

with open(requestfile, "r") as f:

# Sections are begun by a line like:
# >div class="standardbold" id="Pending-Approval">Requests Pending Approval:</div>

   for line in f:
      if "standardbold" in line:
         section = re.search('standardbold" id="(.+)">', line).group(1)
      else:
            if section is not "" and "a href=\"/Request/" in line:
               sections[section]+=1

print "Request Status\n"
print "Pending Approval:      ", sections["Pending-Approval"]
print "Modified:              ", sections["Modified"]
print "Pending Removal        ", sections["Pending-Removal"]
print "On Hold:               ", sections["On Hold"]
print "Clarification Required:", sections["Clarification-Required"]
print "Waiting Implementation:", sections["Waiting-Implementation"]
print "Waiting Removal:       ", sections["Waiting-Removal"]

total = sections["Pending-Approval"] + sections["Modified"] + sections["Pending-Removal"] + sections["Clarification-Required"]
print "\nTotal requiring review:", total

I use a Python dictionary to track the number of entries in each category. The dictionary is named sections and is established with sections={}. I used a dictionary since it can be indexed by keys that can be a string, such as "Pending-Approval", "Modified", etc. I assign all dictionary entries a value of zero initially.

I import the sys module to use for checking whether an argument has been passed to the script on the command line. There should be one argument, the name of the file containing the contents of the web page I downloaded. The location and name of the input file, which is the argument sys.argv[1] provided on the command line, is stored in the variable named requestfile. The line with open(requestfile, "r") as f: opens the input file for reading and for line in f: reads in the file line by line.

I import the re module because it can be used to search for a substring within a string. "RE" stands for "regular expression." In this case, I need to search for specific text on lines in the file I will read in with the script. I can use re.search('searchString', string_to_search).group(1) to find the text for which I'm looking on each line. In this case I set the variable section to be the result of the search re.search('standardbold" id="(.+?)">', line).group(1), since the text for which I'm searching occurs after standardbold" id=" on any lines that mark the beginning of a section, where a section will be Pending-Removal, Modified, etc. I'm interested in the value for id on the line and one of those section names will be after id= and within the double quotes that follow that text. E.g., id="Modified". The section name is followed by ">. E.g., a line might have <div class="standardbold" id="Modified"> , but I only want to extract the section, i.e., queue name, that appears within the double quotes. I don't know speficically how many letters will be in the name, e.g., it could be "Modified" or "Waiting-Removal". I can represent that substring with .+?, which is a regular expression. The period represents any character and the plus sign indicates that there can be one or more occurrences of any character. The plus sign like the question mark in a regular expressions, is a "quantifier", which has the following meaning:

A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk * (derived from the Kleene star), and the plus sign + (Kleene plus).

? The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
* The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
+ The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
{n}[19] The preceding item is matched exactly n times.
{min,}[19] The preceding item is matched min or more times.
{min,max}[19] The preceding item is matched at least min times, but not more than max times.

Putting parentheses around the .+ assigns whatever text is found to a "group", in this case there is only one set of parentheses and one group.

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

So I assign the characters found to the variable named section. If the line doesn't contain a section identifier line, then if a section identifier line has already been found and the line contains the text that identifies each line that contains a request number, i.e., it contains a href="/Request/" then the the relevant dictionary entry has one added to the current number for that entry.

When all lines of the input file have been read, I print the contents of each dictionary entry and then print a total of all entries for which I may need to take action, so they can be implemented or the rules associated with the request removed from relevant firewalls.

I could also have used the following code, instead, to print the entries in the dictionary, i.e., the number of requests in each category.

print "Request Status\n"
for name, value in sections.items():
    print name + ":", value

That code assigns the name for each key to the variable name and the value for that key in the variable value looping through each entry to print its name and value. But that doesn't allow me to control the order in which entries are printed.

Python by Example
Python by Example
1x1 px

Request Status

On Hold: 0
Pending-Approval: 69
Modified: 36
Pending-Removal: 43
Waiting-Implementation: 26
Waiting-Removal: 12
Clarification-Required: 28

Total requiring review: 176