Importing data from a text file into an SQLite database with Python

I need to track work requests that are associated with particular projects. The work requests are submitted through a website, but I want to be able to run queries on the data that aren't available to me from the website. So I decided to add a projects table to an existing SQLite database I have on my MacBook Pro laptop. I didn't want to manually enter a list of about 200 projects into the table, however, so I copied the projects list from the website into a text file. The webpage that lists all of the projects includes projects that are no longer valid and I want to include those in my database table, but note that they are longer valid. When I copy the data from the web page, I have lines like the following ones:

450 SN/GN Tech  Edit Delete
ACE     Edit Delete
ADO     Edit Undelete
AGO     Edit Delete
AGS     Edit Delete
AIM     Edit Delete
Artemis         Edit Undelete
ASF     Edit Delete
ATSC    Edit Undelete
AXAF    Edit Undelete
BATSE/Gro       Edit Undelete
CANDOS  Edit Undelete
CARA    Edit Delete
CD Manager      Edit Delete
CMF     Edit Undelete
CMOC    Edit Delete
CMS     Edit Undelete

The entries that are no longer valid have "Undelete" at the end of the line indicating the projects were deleted, but they can be undeleted.

I created a table in the database with the following columns all of which contain text data:

The Name column I'll fill from the names at the beginning of each line. I've declared it as a "unique" field in the database, since a project name should never occur twice in the table. I'll add descriptions later for some projects. The Added and Removed fields will hold dates for when I add new projects or remove existing ones. The Deleted column will hold a "Y" or "N" based on whether "Delete" or "Undelete" appears on a line. If "Undelete" appears on a line the project has been deleted and I'll put a "Y" in the Deleted column, but will otherwise place an "N" there.

To read the data from the file and import it into the table in the database, I used the following code:

#!/usr/bin/python

import os, sqlite3, sys

try:
   sys.argv[1]
except IndexError:
   print "Error - missing input file name! Usage ./add_projects.py infile"
   sys.exit(1)
else:
   projectsFile = sys.argv[1]

# Check on whether file exists and is accessible
if not os.path.isfile(projectsFile):
   print "Error - input file", projectsFile, "is not accessible!" 
   sys.exit(1)
else:   
    f = open(projectsFile, "r")

conn = sqlite3.connect('/Users/jasmith1/Documents/www/SGRS/SGRS.db')
cursor = conn.cursor()

count = 0
for line in f:
   if "Undelete" in line:
      Deleted = "Y"
   else:
       Deleted = "N"
   projectName = line.split("Edit")[0]
   projectName = projectName.rstrip()
   try: 
      cursor.execute('Insert INTO Projects (Name, Deleted) VALUES (?, ?)', (projectName, Deleted))
   except:
      print "A duplicate record was found for project ", projectName
   count += 1 
conn.commit()
print count, "records processed"  
f.close()

The import os, sqlite3, sys line imports the os module so that I can check on the existence of the input file, the sqlite3 module provides functions for accessing SQLite databases, and the sys module is for sys.exit.

The script should be called with the name and location of the input file that is the text file containing the list of projects. If that isn't provided, the script displays the error message below:

$ ./add-projects.py
Error - missing input file name! Usage ./add_projects.py infile
$

conn = sqlite3.connect('/Users/jasmith1/Documents/www/SGRS/SGRS.db')

The next line cursor = conn.cursor() establishes a "cursor" object for executing all of the SQL statements. The for line in f: statement will read in each line of the input file line by line. Since each line of the file will have a project name at the beginning of the line that may or may not have spaces within the name, but will have "Edit" after the project name, I can split the lines using the split method with the statement below:

projectName = line.split("Edit")[0]

The split method splits a string according to the delimiter that is specified. The delimiter will be the space character if it is not specified. I've specified the delimiter to be the word Edit in this case. Split returns list of substrings.

The statement splits the line at the point it encounters the word "Edit". The first part of the line is in [0] and the rest of the line would be in [1]. I'm only interested in the part before "Edit", i.e., in [0], so I set the variable projectName to be that value. But, since there will be spaces between the project name and the word "Edit", the variable will contain those spaces as well, so I want to remove them which I can do with the rstrip() method. By default, this method returns a copy of the string in which all whitespace characters have been stripped from the end of the string. The syntax for the method is str.rstrip([chars]) where chars are the characters to be trimmed from the end of the line; if no characters are specified, whitespace characters will be removed from the end of the line.

The line below will insert records into the table. I'm only inserting values into the Name and Deleted columns. The contents of the projectName and Deleted variables will be the values represented by question marks in the line below:

cursor.execute('Insert INTO Projects (Name, Deleted) VALUES (?, ?)', (projectName, Deleted))

If there are duplicate lines in the text file used for input, an error message such as the one below will be printed, since the Name field must be unique in the table:

$ ./add-projects.py projects.txt
Traceback (most recent call last):
  File "./add-projects.py", line 31, in <module>
    cursor.execute('Insert INTO Projects (Name, Deleted) VALUES (?, ?)', (projectName, Deleted))
sqlite3.IntegrityError: UNIQUE constraint failed: Projects.Name
$

I could add a print statement after the line.split line that assigns a value to the projectName variable to see when the error occurs.

projectName = line.split("Edit")[0]
print projectName

Alternatively, I've chosen to add try and except statements to handle such cases so that I don't see every project printed line by line until the script comes to the duplicated one and aborts, since I want the script to ignore duplicates and keep on going processing each line until the end of the input file.

On Linux and OS X/macOS systems the uniq command can be used to remove duplicate adjacent lines from a file. E.g., uniq inputfile outputfile where inputfile is the file to be checked and outputfile is the file in which to store only the unique lines from the input file. The input file should be sorted first which can be done by the sort command. E.g., sort inputfile > outputfile. So I could also process the input file outside of the script first to remove duplicate entries, which do occur in the list I get from the website. But I decided to deal with any duplicates in the Python script.

I use the count variable to keep track of the number of lines in the input file processed, so count += 1 simply increments the count variable by one each time a new line from the input file is processed.

If didn't include a commit statement, the script could run without an error, but I wouldn't actually see any records added to the database. It would still be empty when the script finished. The conn.commit() "commits" the changes to the databse. I then print the number of input lines processed and close the input file.