A Microsoft Excel file with an .xlsx or .xlsm
filename extension is an
Open XML (OpenXML) zipped, XML-based file. The OpenXML format was developed by Microsoft for
spreadsheets, charts, presentations and word processing documents. If you
change the file extension to .zip by renaming the file, you can
extract the contents of the zip file as you would with any other
zip file - see Zipping
and unzipping Excel xlsx files. Excel workbooks can contain other documents
embedded within them using
Object Linking and Embedding (OLE) technology - see
Using olefile to
obtain metadata from an OLE CDF V2 file. I often need to extract an
embedded PowerPoint slide or
diagram from Excel .xlsm files, so I've been renaming the files to
zip files and unzipping them as I would other zip files, but, since
I want to automate the process and extract just specific embedded
files for further processing within a Python script, I created the
script below to extract the embedded files, which are contained
xl/embeddings subdirectory within the .xlsm
zip files. The script uses the
to deal with the zip files.
OS module is used to check for the existence of the destination
directory and create it, if it doesn't yet exist.
#!/usr/bin/python import os, zipfile dirToExtract = "xl/embeddings/" destinationDir = "embedded" infile = raw_input("Enter zipfile: ") archive = zipfile.ZipFile(infile) if not os.path.exists(destinationDir): os.makedirs(destinationDir) for file in archive.namelist(): if file.startswith(dirToExtract): archive.extract(file, destinationDir)
The script prompts for the file to be unzipped and then extracts just the "xl/embeddings" folder and the files contained within it to a new directory it will create within the current working directory. The new directory will be named "embedded". After extracting the contents of the "xl/embeddings" directory to the newly created "embedded" folder, I had the the files below in the case of the particular .xlsm file I used for this example.
$ ./extractdir.py Enter zipfile: CRQ-1224294 SDP.xlsm $ ls embedded xl $ ls embedded/xl embeddings $ ls embedded/xl/embeddings Microsoft_Visio_Drawing1.vsdx Microsoft_Visio_Drawing5.vsdx Microsoft_Visio_Drawing10.vsdx Microsoft_Visio_Drawing6.vsdx Microsoft_Visio_Drawing11.vsdx Microsoft_Visio_Drawing7.vsdx Microsoft_Visio_Drawing12.vsdx Microsoft_Visio_Drawing8.vsdx Microsoft_Visio_Drawing2.vsdx Microsoft_Visio_Drawing9.vsdx Microsoft_Visio_Drawing3.vsdx oleObject1.bin Microsoft_Visio_Drawing4.vsdx oleObject2.bin $
But I don't need the directory structure maintained. I.e., I would prefer to
have the embedded information extracted to files in the
directory rather than to a
xl/embeddings subdirectory within
that directory. So I used code provided by Gerhard Götz at
Extract files from zip without keeping the structure using python ZipFile?
That code relies on
ZipFile.infolist() rather than
Return a list containing a ZipInfo object for each member of the archive. The objects are in the same order as their entries in the actual ZIP file on disk if an existing archive was opened.
Return a list of archive members by name.
Using that code and some additional code to accept a file name provided as an argument on the command line, I then have the code below in the script extractdir.py:
#!/usr/bin/python # Name: extractdir.py # Version: 0.2 # Created: 2018-02-10 # Last modified: 2018-02-10 # Purpose: Extract any files embedded in an Excel spreadsheet, e.g., # Microsoft Visio or PowerPoint files, to a directory named "embedded" beneath # the current working directory. The Excel spreadsheet file name can be # provided on the command line; if it isn't the script will prompt for the # file name. import os, sys, zipfile dirToExtract = "xl/embeddings/" destinationDir = "embedded" # Check to see if the file name was entered on the command line. # If it wasn't prompt for the file name try: sys.argv except IndexError: infile = raw_input("Zip file: ") else: infile = sys.argv if not os.path.exists(destinationDir): os.makedirs(destinationDir) with zipfile.ZipFile(infile) as zip: for zip_info in zip.infolist(): if zip_info.filename.startswith(dirToExtract): zip_info.filename = os.path.basename(zip_info.filename) zip.extract(zip_info, destinationDir)
Now when I run the script, I get the extracted files in the
embedded directory beneath the current working directory rather
than two levels down in a subdirectory.
$ ./extractdir.py "CRQ-1224294 SDP.xlsm" $ ls embedded Microsoft_Visio_Drawing1.vsdx Microsoft_Visio_Drawing5.vsdx Microsoft_Visio_Drawing10.vsdx Microsoft_Visio_Drawing6.vsdx Microsoft_Visio_Drawing11.vsdx Microsoft_Visio_Drawing7.vsdx Microsoft_Visio_Drawing12.vsdx Microsoft_Visio_Drawing8.vsdx Microsoft_Visio_Drawing2.vsdx Microsoft_Visio_Drawing9.vsdx Microsoft_Visio_Drawing3.vsdx oleObject1.bin Microsoft_Visio_Drawing4.vsdx oleObject2.bin $