Using olefile to obtain metadata from an OLE CDF V2 file

Microsoft's Object Linking and Embedding (OLE) technology allows embedding and linking to documents and other objects. OLE allows the addition of different kinds of data to a document from different applications, such as a text editor and an image editor. This creates a Compound File Binary Format (CFBF), aka a Compound File, Compound Document format, or Composite Document File V2 (CDF V2) document.

While using my MacBook Pro laptop, which is currently running the OS X El Capitan (10.11.6) operating system, I often need to extract embedded documents from an Excel .xlsm file. I do that by renaming the file to have a .zip rather than a .xlsm file extension. I can then extract the files contained within the .zip file just as I would any zip file. Within the directory structure created by unzipping the zip file there is an xl/embeddings subdirectory with .bin files within it.

Mastering Python
Mastering Python
1x1 px

$ ls xl/embeddings
Microsoft_Visio_Drawing1.vsdx	oleObject2.bin
Microsoft_Visio_Drawing2.vsdx	oleObject3.bin

I can check the file type for those files with the file command.

$ file xl/embeddings/oleObject1.bin
xl/embeddings/oleObject1.bin: CDF V2 Document, Little Endian, Os: Windows, Versi
on 6.1, Code page: 1252, Title: PowerPoint Presentation, Author: Tracy Willams, 
Last Saved By: Rehman, Waheeda R. (ASFC-100)[ACME CORP], Revision Number: 9, Nam
e of Creating Application: Microsoft Office PowerPoint, Total Editing Time: 06:0
2:54, Create Time/Date: Mon Jan  7 00:03:32 2013, Last Saved Time/Date: Tue Jul 
 4 16:34:12 2017, Number of Words: 89

In this case, since I see that the creating application was Microsoft Office PowerPoint, I could change the .bin extension to .ppt and open the file in PowerPoint, but since I want to automate the extraction of information from those files, I looked for a Python module that would allow me to deal with the CDF V2 documents. I found the olefile package.

To install the software on an OS X system, e.g., you can issue the command sudo easy_install olefile in a Terminal window see Easy Install Python Module on OS X. Or you can use the pip package manager to install it with pip install olefile, if you have that package manager installed.

$ sudo easy_install olefile
Enter PIN for 'JOHN DOE': 
Searching for olefile
Best match: olefile 0.45.1
Writing /tmp/easy_install-Ahie_l/olefile-0.45.1/setup.cfg
Running olefile-0.45.1/ -q bdist_egg --dist-dir /tmp/easy_install-Ahie_l/olefile-0.45.1/egg-dist-tmp-ttoP2p
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/ UserWarning: Unknown distribution option: 'python_requires'
no previously-included directories found matching 'doc/_build'
zip_safe flag not set; analyzing archive contents...
Adding olefile 0.45.1 to easy-install.pth file

Installed /Library/Python/2.7/site-packages/olefile-0.45.1-py2.7.egg
Processing dependencies for olefile
Finished processing dependencies for olefile

The olefile package can be used to extract metada about the document, e.g., the author of the document, who last saved it, the time the document was created and the time it was last saved. Information on using olefile can be found at How to use olefile - API overview. A python script I created to use the capabilities of olefile, is shown below.

Python by Example
Python by Example
1x1 px

Microsoft Excel 103 - Advanced Excel
Microsoft Excel 103 - Advanced Excel


import olefile, os.path, sys

# Check to see if the file name was entered on the command line.
# If it wasn't prompt for the file name
except IndexError:
   infile = raw_input("Enter file name: ")
   infile = sys.argv[1]

if os.path.isfile(infile):
   # Determine if the first bytes of the file contain the magic number for OLE 
   # files, before opening it. isOleFile returns True if it is an OLE file, 
   # False otherwise 
   if olefile.isOleFile(infile):
      ole = olefile.OleFileIO(infile)
      meta = olefile.OleFileIO.get_metadata(ole)
      print('Last saved by:', meta.last_saved_by)
      print('Title:', meta.title)
      print('Creation date:', meta.create_time)
      # print all metadata:
      print infile, "is not an OLE file"

I can display individual elements from the metadata, such as author, last_saved_by, create_time, etc. or dump all of the metadata with meta.dump().

E.g., after installing olefile, when I checked one of the .bin files extracted from the .xlsm file that had been renamed to a .zip file and then unzipped, I saw the following information:

Udemy Generic Category (English)120x600
$ ./ xl/embeddings/oleObject1.bin
('Author:', 'Tracy Williams')
('Last saved by:', 'Rehman, Waheeda R. (ASFC-100)[ACME CORP]')
('Title:', 'PowerPoint Presentation')
('Creation date:', datetime.datetime(2013, 1, 8, 0, 3, 32, 886000))
Properties from SummaryInformation stream:
- codepage: 1252
- title: 'PowerPoint Presentation'
- subject: None
- author: 'Tracy Williams'
- keywords: None
- comments: None
- template: None
- last_saved_by: 'Rehman, Waheeda R. (ASFC-100)[ACME CORP]'
- revision_number: '9'
- total_edit_time: 21774L
- last_printed: None
- create_time: datetime.datetime(2013, 1, 8, 0, 3, 32, 886000)
- last_saved_time: datetime.datetime(2017, 7, 5, 15, 34, 12, 771000)
- num_pages: None
- num_words: 89
- num_chars: None
- thumbnail: None
- creating_application: 'Microsoft Office PowerPoint'
- security: None
Properties from DocumentSummaryInformation stream:
- codepage_doc: 1252
- category: None
- presentation_target: 'On-screen Show (4:3)'
- bytes: 40010
- lines: None
- paragraphs: 7
- slides: 1
- notes: 0
- hidden_slides: 0
- mm_clips: 0
- scale_crop: False
- heading_pairs: None
- titles_of_parts: None
- manager: None
- company: ''
- links_dirty: False
- chars_with_spaces: None
- unused: None
- shared_doc: False
- link_base: None
- hlinks: None
- hlinks_changed: False
- version: 983040
- dig_sig: None
- content_type: None
- content_status: None
- language: None
- doc_version: None

The dates are displayed as year, month, day.

Related articles:

  1. Easy Install Python Module on OS X
  2. Extracting embedded documents from an Excel .xlsm file
  3. Zipping and unzipping Excel xlsx files
  4. Using Python to extract data from a spreadsheet and put it in a database