Microsoft's Object Linking and Embedding (OLE) technology allows embedding and linking to documents and other objects. OLE allows the addition of different kinds of data to a document from different applications, such as a text editor and an image editor. This creates a Compound File Binary Format (CFBF), aka a Compound File, Compound Document format, or Composite Document File V2 (CDF V2) document.
While using my MacBook Pro laptop, which is currently running the
OS X El
Capitan (10.11.6) operating system, I often need to
extract embedded documents from an
Excel .xlsm file. I do that by renaming the file to have a .zip rather than
a .xlsm file extension. I can then extract the files contained
within the .zip file just as I would any
zip file. Within the directory structure created by unzipping the
zip file there is an
xl/embeddings subdirectory with .bin
files within it.
$ ls xl/embeddings Microsoft_Visio_Drawing1.vsdx oleObject2.bin Microsoft_Visio_Drawing2.vsdx oleObject3.bin oleObject1.bin $
I can check the file type for those files with the file command.
$ file xl/embeddings/oleObject1.bin xl/embeddings/oleObject1.bin: CDF V2 Document, Little Endian, Os: Windows, Versi on 6.1, Code page: 1252, Title: PowerPoint Presentation, Author: Tracy Willams, Last Saved By: Rehman, Waheeda R. (ASFC-100)[ACME CORP], Revision Number: 9, Nam e of Creating Application: Microsoft Office PowerPoint, Total Editing Time: 06:0 2:54, Create Time/Date: Mon Jan 7 00:03:32 2013, Last Saved Time/Date: Tue Jul 4 16:34:12 2017, Number of Words: 89 $
In this case, since I see that the creating application was Microsoft Office PowerPoint, I could change the .bin extension to .ppt and open the file in PowerPoint, but since I want to automate the extraction of information from those files, I looked for a Python module that would allow me to deal with the CDF V2 documents. I found the olefile package.
To install the software on an OS X system, e.g., you can issue the
sudo easy_install olefile in a
see Easy Install Python Module on
OS X. Or you can use the pip package manager to install it
pip install olefile, if you have that
package manager installed.
$ sudo easy_install olefile Enter PIN for 'JOHN DOE': Searching for olefile Reading https://pypi.python.org/simple/olefile/ Best match: olefile 0.45.1 Downloading https://pypi.python.org/packages/d3/8a/e0f0e56d6a542dd987f9290ef7b5164636ee597ce8c2932c19c78292d5ec/olefile-0.45.1.zip#md5=f70c0688320548ae0f1b4785e7aefcb9 Processing olefile-0.45.1.zip Writing /tmp/easy_install-Ahie_l/olefile-0.45.1/setup.cfg Running olefile-0.45.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-Ahie_l/olefile-0.45.1/egg-dist-tmp-ttoP2p /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'python_requires' warnings.warn(msg) no previously-included directories found matching 'doc/_build' zip_safe flag not set; analyzing archive contents... Adding olefile 0.45.1 to easy-install.pth file Installed /Library/Python/2.7/site-packages/olefile-0.45.1-py2.7.egg Processing dependencies for olefile Finished processing dependencies for olefile $
The olefile package can be used to extract
metada about the
document, e.g., the author of the document, who last saved it, the time
the document was created and the time it was last saved.
Information on using olefile can be found at
How to use olefile
- API overview. A python script I created to use the capabilities
checkOLE.py is shown below.
#!/usr/bin/python import olefile, os.path, sys # Check to see if the file name was entered on the command line. # If it wasn't prompt for the file name try: sys.argv except IndexError: infile = raw_input("Enter file name: ") else: infile = sys.argv if os.path.isfile(infile): # Determine if the first bytes of the file contain the magic number for OLE # files, before opening it. isOleFile returns True if it is an OLE file, # False otherwise if olefile.isOleFile(infile): ole = olefile.OleFileIO(infile) meta = olefile.OleFileIO.get_metadata(ole) print('Author:', meta.author) print('Last saved by:', meta.last_saved_by) print('Title:', meta.title) print('Creation date:', meta.create_time) # print all metadata: meta.dump() else: print infile, "is not an OLE file"
I can display individual elements from the metadata, such as
author, last_saved_by, create_time, etc. or dump all of the metadata with
E.g., after installing olefile, when I checked one of the .bin files extracted from the .xlsm file that had been renamed to a .zip file and then unzipped, I saw the following information:
$ ./checkOLE.py xl/embeddings/oleObject1.bin ('Author:', 'Tracy Williams') ('Last saved by:', 'Rehman, Waheeda R. (ASFC-100)[ACME CORP]') ('Title:', 'PowerPoint Presentation') ('Creation date:', datetime.datetime(2013, 1, 8, 0, 3, 32, 886000)) Properties from SummaryInformation stream: - codepage: 1252 - title: 'PowerPoint Presentation' - subject: None - author: 'Tracy Williams' - keywords: None - comments: None - template: None - last_saved_by: 'Rehman, Waheeda R. (ASFC-100)[ACME CORP]' - revision_number: '9' - total_edit_time: 21774L - last_printed: None - create_time: datetime.datetime(2013, 1, 8, 0, 3, 32, 886000) - last_saved_time: datetime.datetime(2017, 7, 5, 15, 34, 12, 771000) - num_pages: None - num_words: 89 - num_chars: None - thumbnail: None - creating_application: 'Microsoft Office PowerPoint' - security: None Properties from DocumentSummaryInformation stream: - codepage_doc: 1252 - category: None - presentation_target: 'On-screen Show (4:3)' - bytes: 40010 - lines: None - paragraphs: 7 - slides: 1 - notes: 0 - hidden_slides: 0 - mm_clips: 0 - scale_crop: False - heading_pairs: None - titles_of_parts: None - manager: None - company: '' - links_dirty: False - chars_with_spaces: None - unused: None - shared_doc: False - link_base: None - hlinks: None - hlinks_changed: False - version: 983040 - dig_sig: None - content_type: None - content_status: None - language: None - doc_version: None $
The dates are displayed as year, month, day.