Extracting embedded Microsoft Office files from an Excel spreadsheet

I work with Excel workbooks on my MacBook Pro laptop that have embedded PowerPoint slides on some worksheets. The workbooks, which I need to review, are created by others. When I review them, I extract information from the Excel workbooks to an SQLite database with Python and also have begun extracting information embedded by Object Linking and Embedding (OLE) into files as noted in Extracting the contents of a directory in a zipfile using Python. Some of the embedded files are PowerPoint files, but when they are extracted they have a .bin extension, which I can't open in PowerPoint without changing the filename extension from .bin to .ppt. To automate the renaming process, I created a Python script, extract_embedded.py that will extract the embedded information to files in an "embedded" directory beneath the current working directory and then rename any .bin files that are PowerPoint files to have a .ppt extension. The script is shown below.

Udemy Generic Category (English)120x600

#!/usr/bin/python

# Name: extract_embedded.py
# Version: 0.1
# Created: 2018-02-14
# Last modified: 2018-02-14
# Description: Accepts the name of an Excel spreadsheet file, such as a
# .xlsx or .xlsm file, that are Office Open XML (OOXML) files as a command
# line argument or, if one is not provided on the command line, will prompt
# for the input fil ename. OOXML is a zipped, XML-based file format developed 
# by Microsoft for representing spreadsheets, charts, presentations and word 
# processing documents. 
#
# The contents of the xl/embeddings subdirectory within the OOXML file will
# be exracted to a directory named "embedded" beneath the current working
# directory. That directory will be created, if it doesn't already exist.
# After the files are extracted, any .bin files that are PowerPoint files
# will be renamed to have a .ppt extension. The olefile module is used to
# determine if a .bin file is a PowerPoint file.

import os, sys

def extractdir(infile):

# Extract all of the embedded files from an Excel .xlsx or .xlsm workbook

   import zipfile

   dirToExtract = "xl/embeddings/"
   destinationDir = "embedded"
   
   if not os.path.exists(destinationDir):
       os.makedirs(destinationDir)
   
   with zipfile.ZipFile(infile) as zip:
       for zip_info in zip.infolist():
           if zip_info.filename.startswith(dirToExtract):
               zip_info.filename = os.path.basename(zip_info.filename)
               zip.extract(zip_info, destinationDir)

def fixExtension(infile):

# If infile is a PowerPoint file with a .bin extension, it will be renamed to
# have a .ppt filename extension.

   import olefile

   # Determine if the first bytes of the file contain the magic number for OLE 
   # files, before opening it. isOleFile returns True if it is an OLE file, 
   # False otherwise 
   if olefile.isOleFile(infile):
      ole = olefile.OleFileIO(infile)
      if ole.exists('PowerPoint Document'):
         path_and_filename, file_extension = os.path.splitext(infile)
         # Replace the .bin extension with a .ppt one
         os.rename(infile, path_and_filename + ".ppt")

# The directory where the embedded files will be stored
path = "embedded"

# Check to see if the file name was entered on the command line.
# If it wasn't prompt for the file name
try:
   sys.argv[1]
except IndexError:
   infile = raw_input("Zip file: ")
else:
   infile = sys.argv[1]

# Extract all embedded files to an "embedded" directory beneath the current
# working directory.
extractdir(infile)
# For all files in the "embedded" directory, if the file is a PowerPoint file
# with a .bin extension, rename the file to have a .ppt extension
for file in os.listdir(path):
    current_file = os.path.join(path, file)
    fixExtension(current_file)

The script relies upon the zipfile and olefile modules. If an Excel spreadsheet file, such as a .xlsx or .xlsm file, is provided as an argument on the command line, it will be used as the input file. If one is not specified on the command line, the script will prompt for the file name.

Two functions are used, one extractdir will extract all of the embedded files from the Excel workbook, placing them in a directory named embedded within the current working directory. That directory will be created if it doesn't already exist. After the files are extracted, the script uses the fixExtension module to put the needed extension on the filename.

Related articles:

  1. Using Python to extract data from a spreadsheet and put it in a database
  2. Extracting the contents of a directory in a zipfile using Python
  3. Using olefile to obtain metadata from an OLE CDF V2 file
  4. Determining a file's type from within a Python script
  5. Using Python scripts with Apache on OS X El Capitan