Extracting the contents of a directory in a zipfile using Python

A Microsoft Excel file with an .xlsx or .xlsm filename extension is an Office Open XML (OpenXML) zipped, XML-based file. The OpenXML format was developed by Microsoft for spreadsheets, charts, presentations and word processing documents. If you change the file extension to .zip by renaming the file, you can extract the contents of the zip file as you would with any other zip file - see Zipping and unzipping Excel xlsx files. Excel workbooks can contain other documents embedded within them using Object Linking and Embedding (OLE) technology - see Using olefile to obtain metadata from an OLE CDF V2 file. I often need to extract an embedded PowerPoint slide or Visio diagram from Excel .xlsm files, so I've been renaming the files to zip files and unzipping them as I would other zip files, but, since I want to automate the process and extract just specific embedded files for further processing within a Python script, I created the script below to extract the embedded files, which are contained within a xl/embeddings subdirectory within the .xlsm zip files. The script uses the zipfile module to deal with the zip files. Python's OS module is used to check for the existence of the destination directory and create it, if it doesn't yet exist.

Python by Example
Python by Example
1x1 px

#!/usr/bin/python

import os, zipfile

dirToExtract = "xl/embeddings/"
destinationDir = "embedded"
infile = raw_input("Enter zipfile: ")
archive = zipfile.ZipFile(infile)

if not os.path.exists(destinationDir):
    os.makedirs(destinationDir)

for file in archive.namelist():
    if file.startswith(dirToExtract):
        archive.extract(file, destinationDir)

The script prompts for the file to be unzipped and then extracts just the "xl/embeddings" folder and the files contained within it to a new directory it will create within the current working directory. The new directory will be named "embedded". After extracting the contents of the "xl/embeddings" directory to the newly created "embedded" folder, I had the the files below in the case of the particular .xlsm file I used for this example.

Mastering Python
Mastering Python
1x1 px

$ ./extractdir.py
Enter zipfile: CRQ-1224294 SDP.xlsm
$ ls embedded
xl
$ ls embedded/xl
embeddings
$ ls embedded/xl/embeddings
Microsoft_Visio_Drawing1.vsdx	Microsoft_Visio_Drawing5.vsdx
Microsoft_Visio_Drawing10.vsdx	Microsoft_Visio_Drawing6.vsdx
Microsoft_Visio_Drawing11.vsdx	Microsoft_Visio_Drawing7.vsdx
Microsoft_Visio_Drawing12.vsdx	Microsoft_Visio_Drawing8.vsdx
Microsoft_Visio_Drawing2.vsdx	Microsoft_Visio_Drawing9.vsdx
Microsoft_Visio_Drawing3.vsdx	oleObject1.bin
Microsoft_Visio_Drawing4.vsdx	oleObject2.bin
$

But I don't need the directory structure maintained. I.e., I would prefer to have the embedded information extracted to files in the embedded directory rather than to a xl/embeddings subdirectory within that directory. So I used code provided by Gerhard Götz at Extract files from zip without keeping the structure using python ZipFile? That code relies on ZipFile.infolist() rather than ZipFile.namelist().

ZipFile.infolist()

Return a list containing a ZipInfo object for each member of the archive. The objects are in the same order as their entries in the actual ZIP file on disk if an existing archive was opened.

ZipFile.namelist()

Return a list of archive members by name.

Using that code and some additional code to accept a file name provided as an argument on the command line, I then have the code below in the script extractdir.py:



#!/usr/bin/python

# Name: extractdir.py
# Version: 0.2
# Created: 2018-02-10
# Last modified: 2018-02-10
# Purpose: Extract any files embedded in an Excel spreadsheet, e.g.,
# Microsoft Visio or PowerPoint files, to a directory named "embedded" beneath
# the current working directory. The Excel spreadsheet file name can be
# provided on the command line; if it isn't the script will prompt for the
# file name.

import os, sys, zipfile

dirToExtract = "xl/embeddings/"
destinationDir = "embedded"

# Check to see if the file name was entered on the command line.
# If it wasn't prompt for the file name
try:
   sys.argv[1]
except IndexError:
   infile = raw_input("Zip file: ")
else:
   infile = sys.argv[1]

if not os.path.exists(destinationDir):
    os.makedirs(destinationDir)

with zipfile.ZipFile(infile) as zip:
    for zip_info in zip.infolist():
        if zip_info.filename.startswith(dirToExtract):
            zip_info.filename = os.path.basename(zip_info.filename)
            zip.extract(zip_info, destinationDir)

Now when I run the script, I get the extracted files in the embedded directory beneath the current working directory rather than two levels down in a subdirectory.

$ ./extractdir.py "CRQ-1224294 SDP.xlsm"
$ ls embedded
Microsoft_Visio_Drawing1.vsdx	Microsoft_Visio_Drawing5.vsdx
Microsoft_Visio_Drawing10.vsdx	Microsoft_Visio_Drawing6.vsdx
Microsoft_Visio_Drawing11.vsdx	Microsoft_Visio_Drawing7.vsdx
Microsoft_Visio_Drawing12.vsdx	Microsoft_Visio_Drawing8.vsdx
Microsoft_Visio_Drawing2.vsdx	Microsoft_Visio_Drawing9.vsdx
Microsoft_Visio_Drawing3.vsdx	oleObject1.bin
Microsoft_Visio_Drawing4.vsdx	oleObject2.bin
$

Related articles:

  1. Zipping and unzipping Excel xlsx files
  2. Excel 2016 - Workbook Protected
  3. Can't insert worksheet in Microsoft Excel for Mac 2016
  4. Using olefile to obtain metadata from an OLE CDF V2 file