With spreadsheet applications you normally have rows sequenced by numbers and columns sequenced by alphabetical characters. With xlrd, though, both rows and columns are specified by numbers with the first row starting at 0 and the first column starting at 0 as well.
There are .xlsm workbook files that I normally work with that have a worksheet within them named "General Information". From each wokkbook, I want to extract the contents of cell D5, which contains a Change Request (CRQ) number, the contents of D7, which contains a summary of the work to be done, and J3, which contains the date of the request. I can do that with the following Python script:
#!/usr/bin/python # Extract data from CRQ workbooks # Used to check if file exists import os.path from xlrd import open_workbook,cellname spreadsheet = raw_input("Enter file name: ") if os.path.isfile(spreadsheet): workbook = open_workbook(spreadsheet) worksheet = workbook.sheet_by_name('General Information') # The CRQ number cell is D5; column and row numbers start at 0, so the # row is 4 and the column is 3 row_index = 4 col_index = 3 crqnum = worksheet.cell(row_index,col_index) # The summary cell is D7; column and row numbers start at 0, so the # row is 4 and the column is 3 row_index = 6 col_index = 3 summary = worksheet.cell(row_index,col_index) # The date of request cell is J3 row_index = 2 col_index = 9 requestDate = worksheet.cell(row_index,col_index) print crqnum print summary print requestDate
When I run the script it will prompt me for the file name for the workbook I want to process at that moment and then display output like the following results:
$ ~/Documents/bin/extract-crq-data.py Enter file name: CRQ-1429793 SDP.xlsm number:1429793.0 text:u"Provide cost estimate and ATP to activate LPA's-0255, 0256 and 0257 for B-25 and the NIC." xldate:43083.0 $
The CRQ number is stoed in cell D5 as a number, while the summary information is stored as text, and the request date is stored as a date. When I print the contents of the crqnum, requestDate, and summary variables, the type of the data is displayed as well as the value, i.e., "number", "text", and "xldate". But I only want the value displayed, so I can use the following print statements instead of the ones above:
print "CRQ Number: ", crqnum.value print "Summary: ", summary.value print "Date of Request: ", requestDate.value
The output will then look like that shown below:
$ ~/Documents/bin/extract-crq-data.py Enter file name: CRQ-1429793 SDP.xlsm CRQ Number: 1429793.0 Summary: Provide cost estimate to activate LPA's-0255, 0256 and 0257 Date of Request: 43083.0 $
But that's not exactly what I want, either. The request number shouldn't
have any fractional part, so I want to discard the period and following zero
for that number and I want the date to be displayed as a calendar date not
as a number reflecting
Excel's serial date
format. Excel stores dates and times as a number representing the number
of days since 1900-Jan-0 plus a fractional part of a 24 hour day, i.e.,
ddddd.tttttt
where ddddd is the number representing the
day and tttttt is the fractional part of the day. And as noted at
Dates And Times In
Excel:
Actually, this number is one greater than the actual number of days. This is because Excel behaves as if the date 1900-Feb-29 existed. It did not. The year 1900 was not a leap year (the year 2000 is a leap year). In Excel, the day after 1900-Feb-28 is 1900-Feb-29. In reality, the day after 1900-Feb-28 was 1900-Mar-1 . This is not a "bug". Indeed, it is by design. Excel works this way because it was truly a bug in Lotus 123. When Excel was introduced, 123 has nearly the entire market for spreadsheet software. Microsoft decided to continue Lotus' bug, in order to fully compatible. Users who switched from 123 to Excel would not have to make any changes to their data. As long as all your dates later than 1900-Mar-1, this should be of no concern.
Other spreadsheet programs, such as Apache OpenOffice Calc and Google Sheets didn't carry forward the Lotus 1-2-3 bug, though as noted in Workday function.
So how do you convert a date in Excel's serial date format to a calendar date such as 01/26/2018? You can use xlrd's xldate module as mentioned in How do I read a date in Excel format in Python? and convert xldate to python datetime. So, for my case, I can modify the import line for xlrd to the following:
from xlrd import open_workbook,cellname,xldate
And for converting the request date to a more understandable calendar date, I can use the following code:
# The date of request cell is J3 row_index = 2 col_index = 9 requestDate = worksheet.cell(row_index,col_index) requestDate = requestDate.value displayDate = xldate.xldate_as_datetime(requestDate, workbook.datemode)
Then when I use print "Date of Request: ", displayDate
, I
see the following output:
$ ~/Documents/bin/extract-crq-data.py Enter file name: CRQ-1429793 SDP.xlsm CRQ Number: 1429793.0 Summary: Provide cost estimate to activate LPA's-0255, 0256 and 0257 Date of Request: 2017-12-14 00:00:00 $
That's much closer, but I don't need the time, which is displayed as
00:00:00 in this case, just the date, so I can discard the time and retain
just the date by using
strftime
from the datetime
module. So I need to import that module
by modifying the import line I had in the code to be as follows:
import datetime, os.path
Now I can print the date with the following line of code:
print "Date of Request:", displayDate.strftime('%Y-%m-%d')
The output would then be as follows:
$ ~/Documents/bin/extract-crq-data.py Enter file name: CRQ-1429793 SDP.xlsm CRQ Number: 1429793.0 Summary: Provide cost estimate to activate LPA's-0255, 0256 and 0257 Date of Request: 2017-12-14 $
If I wanted the date, instead in the format, mm/dd/yyyy, I could modify the strftime line to be as follows:
print "Date of Request:", displayDate.strftime('%m/%d/%Y')
Or, if I wanted only a 2-digit year displayed, I could use %y
,
i.e., a lower-case "y" rather than an uppercase "Y". But, since I prefer the
ISO 8601
date format of YYYY-MM-DD
, I'll maintain the first version. I
can convert the
floating-point number, 1429793.0 to an integer with
crqnum = int(crqnum.value)
, which would result in
1429793 appearing in the output rather than 1429793.0. For the new code, I
have the following:
#!/usr/bin/python # Extract data from CRQ workbooks # Used for date conversion and to check if file exists import datetime, os.path from xlrd import open_workbook,cellname,xldate spreadsheet = raw_input("Enter file name: ") if os.path.isfile(spreadsheet): workbook = open_workbook(spreadsheet) worksheet = workbook.sheet_by_name('General Information') # The CRQ number cell is D5; column and row numbers start at 0, so the # row is 4 and the column is 3 row_index = 4 col_index = 3 crqnum = worksheet.cell(row_index,col_index) crqnum = int(crqnum.value) # The summary cell is D7; column and row numbers start at 0, so the # row is 4 and the column is 3 row_index = 6 col_index = 3 summary = worksheet.cell(row_index,col_index) summary = summary.value # The date of request cell is J3 row_index = 2 col_index = 9 requestDate = worksheet.cell(row_index,col_index) requestDate = requestDate.value displayDate = xldate.xldate_as_datetime(requestDate, workbook.datemode) print "CRQ Number:", crqnum print "Summary:", summary print "Date of Request:", displayDate.strftime('%Y-%m-%d') else: print spreadsheet, " - file not found!"
Related articles: