I work with Excel workbooks on my
MacBook Pro
laptop that have embedded
PowerPoint
slides on some worksheets. The workbooks, which I need to review, are
created by others. When I review them, I
extract information from
the Excel workbooks to an SQLite database with Python and also have begun
extracting information embedded by
Object Linking and Embedding (OLE) into files as
noted in Extracting the
contents of a directory in a zipfile using Python. Some of the
embedded files are PowerPoint files, but when they are extracted they
have a .bin extension, which I can't open in
PowerPoint without changing the
filename extension from .bin to .ppt. To automate the renaming process,
I created a Python script, extract_embedded.py
that will extract
the embedded information to files in an "embedded" directory beneath the
current working directory and then rename any .bin files that are PowerPoint
files to have a .ppt extension. The script is shown below.
#!/usr/bin/python # Name: extract_embedded.py # Version: 0.1 # Created: 2018-02-14 # Last modified: 2018-02-14 # Description: Accepts the name of an Excel spreadsheet file, such as a # .xlsx or .xlsm file, that are Office Open XML (OOXML) files as a command # line argument or, if one is not provided on the command line, will prompt # for the input fil ename. OOXML is a zipped, XML-based file format developed # by Microsoft for representing spreadsheets, charts, presentations and word # processing documents. # # The contents of the xl/embeddings subdirectory within the OOXML file will # be exracted to a directory named "embedded" beneath the current working # directory. That directory will be created, if it doesn't already exist. # After the files are extracted, any .bin files that are PowerPoint files # will be renamed to have a .ppt extension. The olefile module is used to # determine if a .bin file is a PowerPoint file. import os, sys def extractdir(infile): # Extract all of the embedded files from an Excel .xlsx or .xlsm workbook import zipfile dirToExtract = "xl/embeddings/" destinationDir = "embedded" if not os.path.exists(destinationDir): os.makedirs(destinationDir) with zipfile.ZipFile(infile) as zip: for zip_info in zip.infolist(): if zip_info.filename.startswith(dirToExtract): zip_info.filename = os.path.basename(zip_info.filename) zip.extract(zip_info, destinationDir) def fixExtension(infile): # If infile is a PowerPoint file with a .bin extension, it will be renamed to # have a .ppt filename extension. import olefile # Determine if the first bytes of the file contain the magic number for OLE # files, before opening it. isOleFile returns True if it is an OLE file, # False otherwise if olefile.isOleFile(infile): ole = olefile.OleFileIO(infile) if ole.exists('PowerPoint Document'): path_and_filename, file_extension = os.path.splitext(infile) # Replace the .bin extension with a .ppt one os.rename(infile, path_and_filename + ".ppt") # The directory where the embedded files will be stored path = "embedded" # Check to see if the file name was entered on the command line. # If it wasn't prompt for the file name try: sys.argv[1] except IndexError: infile = raw_input("Zip file: ") else: infile = sys.argv[1] # Extract all embedded files to an "embedded" directory beneath the current # working directory. extractdir(infile) # For all files in the "embedded" directory, if the file is a PowerPoint file # with a .bin extension, rename the file to have a .ppt extension for file in os.listdir(path): current_file = os.path.join(path, file) fixExtension(current_file)
The script relies upon the zipfile and olefile modules. If an Excel spreadsheet file, such as a .xlsx or .xlsm file, is provided as an argument on the command line, it will be used as the input file. If one is not specified on the command line, the script will prompt for the file name.
Two functions are used, one extractdir
will extract all of
the embedded files from the Excel workbook, placing them in a directory
named embedded
within the current working directory. That directory
will be created if it doesn't already exist. After the files are extracted, the
script uses the fixExtension
module to put the needed extension
on the filename.
Related articles: