I need to track work requests that are associated with particular projects. The work requests are submitted through a website, but I want to be able to run queries on the data that aren't available to me from the website. So I decided to add a projects table to an existing SQLite database I have on my MacBook Pro laptop. I didn't want to manually enter a list of about 200 projects into the table, however, so I copied the projects list from the website into a text file. The webpage that lists all of the projects includes projects that are no longer valid and I want to include those in my database table, but note that they are longer valid. When I copy the data from the web page, I have lines like the following ones:
450 SN/GN Tech Edit Delete ACE Edit Delete ADO Edit Undelete AGO Edit Delete AGS Edit Delete AIM Edit Delete Artemis Edit Undelete ASF Edit Delete ATSC Edit Undelete AXAF Edit Undelete BATSE/Gro Edit Undelete CANDOS Edit Undelete CARA Edit Delete CD Manager Edit Delete CMF Edit Undelete CMOC Edit Delete CMS Edit Undelete
The entries that are no longer valid have "Undelete" at the end of the line indicating the projects were deleted, but they can be undeleted.
I created a table in the database with the following columns all of which contain text data:
Name | Description | Added | Removed | Deleted | Notes |
---|
The Name column I'll fill from the names at the beginning of each line. I've declared it as a "unique" field in the database, since a project name should never occur twice in the table. I'll add descriptions later for some projects. The Added and Removed fields will hold dates for when I add new projects or remove existing ones. The Deleted column will hold a "Y" or "N" based on whether "Delete" or "Undelete" appears on a line. If "Undelete" appears on a line the project has been deleted and I'll put a "Y" in the Deleted column, but will otherwise place an "N" there.
To read the data from the file and import it into the table in the database, I used the following code:
#!/usr/bin/python import os, sqlite3, sys try: sys.argv[1] except IndexError: print "Error - missing input file name! Usage ./add_projects.py infile" sys.exit(1) else: projectsFile = sys.argv[1] # Check on whether file exists and is accessible if not os.path.isfile(projectsFile): print "Error - input file", projectsFile, "is not accessible!" sys.exit(1) else: f = open(projectsFile, "r") conn = sqlite3.connect('/Users/jasmith1/Documents/www/SGRS/SGRS.db') cursor = conn.cursor() count = 0 for line in f: if "Undelete" in line: Deleted = "Y" else: Deleted = "N" projectName = line.split("Edit")[0] projectName = projectName.rstrip() try: cursor.execute('Insert INTO Projects (Name, Deleted) VALUES (?, ?)', (projectName, Deleted)) except: print "A duplicate record was found for project ", projectName count += 1 conn.commit() print count, "records processed" f.close()
The import os, sqlite3, sys
line imports the os module so that
I can check on the existence of the input file, the sqlite3 module provides
functions for accessing
SQLite databases, and the sys module is for sys.exit
.
The script should be called with the name and location of the input file that is the text file containing the list of projects. If that isn't provided, the script displays the error message below:
$ ./add-projects.py Error - missing input file name! Usage ./add_projects.py infile $
The line below opens a connection to the specified SQLite database file.
conn = sqlite3.connect('/Users/jasmith1/Documents/www/SGRS/SGRS.db')
The next line cursor = conn.cursor()
establishes a
"cursor" object for executing all of the
SQL statements.
The for line in f:
statement will read in each line of the
input file line by line. Since each line of the file will have a project
name at the beginning of the line that may or may not have spaces within
the name, but will have "Edit" after the project name, I can split the
lines using the split
method with the statement below:
projectName = line.split("Edit")[0]
The split method splits a string according to the delimiter that is
specified. The delimiter will be the space character if it is not specified.
I've specified the delimiter to be the word Edit
in this
case. Split returns list of substrings.
The statement splits the line at the point it encounters the word "Edit".
The first part of the line is in [0]
and the rest of the
line would be in [1]
. I'm only interested in the part before
"Edit", i.e., in [0]
, so I set the
variable
projectName to be that value. But, since there will be spaces between the
project name and the word "Edit", the variable will contain those spaces as
well, so I want to remove them which I can do with the
rstrip()
method. By default, this method returns a copy of the string in which all
whitespace
characters have been stripped from the end of the string. The syntax
for the method is str.rstrip([chars])
where chars are
the characters to be trimmed from the end of the line; if no characters
are specified, whitespace characters will be removed from the end of the
line.
The line below will insert records into the table. I'm only inserting
values into the Name
and Deleted
columns. The
contents of the projectName
and Deleted
variables
will be the values represented by question marks in the line below:
cursor.execute('Insert INTO Projects (Name, Deleted) VALUES (?, ?)', (projectName, Deleted))
If there are duplicate lines in the text file used for input, an error message such as the one below will be printed, since the Name field must be unique in the table:
$ ./add-projects.py projects.txt Traceback (most recent call last): File "./add-projects.py", line 31, in <module> cursor.execute('Insert INTO Projects (Name, Deleted) VALUES (?, ?)', (projectName, Deleted)) sqlite3.IntegrityError: UNIQUE constraint failed: Projects.Name $
I could add a print statement after the line.split line that assigns a value to the projectName variable to see when the error occurs.
projectName = line.split("Edit")[0] print projectName
Alternatively, I've chosen to add try
and except
statements to handle such cases so that I don't see every project printed
line by line until the script comes to the duplicated one and aborts, since
I want the script to ignore duplicates and keep on going processing each line
until the end of the input file.
On Linux and OS X/macOS systems the uniq
command can be used to remove duplicate adjacent lines from a file.
E.g., uniq inputfile outputfile
where inputfile is
the file to be checked and outputfile is the file in which to
store only the unique lines from the input file. The input file should
be sorted first which can be done by the
sort command.
E.g., sort inputfile > outputfile
. So I could also process the
input file outside of the script first to remove duplicate entries, which
do occur in the list I get from the website. But I decided to deal with
any duplicates in the Python script.
I use the count
variable to keep track of the number of lines
in the input file processed, so count += 1
simply increments
the count variable by one each time a new line from the input file is
processed.
If didn't include a commit statement, the script could run without an
error, but I wouldn't actually see any records added to the database. It
would still be empty when the script finished. The conn.commit()
"commits" the changes to the databse. I then print the number of input lines
processed and close the input file.
Related articles:
References: