Extracting information from a .msg file with Python

I received a .msg file attachment to an email message I received with Microsoft Outlook for Mac, which is part of Microsoft Office 2016 on my MacBook Pro laptop. When I double-clicked on the attachment in Outlook to view the contents of the file, I saw "There is no application specified to open the document Re_ Netbond.msg."

Mastering Python

And also a window giving me an option to "Search App Store" with the message "Search the App Store for an application that can open this document, or choose an existing application on your computer."

Python by Example

I saved the attachment to the system's disk drive and checked on the file type with the file command, which reported it was a Composite Document File V2 (CDF V2) document.

$ file Re_Netbond.msg 
Re_Netbond.msg: CDF V2 Document, No summary info
$

When I tried displaying the contents of the the file with the more command, I was warned the file might be a binary file and when I chose to display the conents anyway, the output was unreadable to me.

Since I didn't have any application on the laptop, which runs OS X El Capitan, to view msg files, I decided to perform a Google search for a Python script that would be able to decode the message. I found ExtractMsg.py, currently at version 0.3, on GitHub. The Readme file for the script notes:

Extracts emails and attachments saved in Microsoft Outlook's .msg files

The python script ExtractMsg.py automates the extraction of key email data (from, to, cc, date, subject, body) and the email's attachments.

To use it

python ExtractMsg.py example.msg

This will produce a new folder named according to the date, time and subject of the message (for example "2013-07-24_0915 Example"). The email itself can be found inside the new folder along with the attachments. As of version 0.2, it is capable of extracting both ASCII and Unicode data.

The script uses Philippe Lagadec's Python module that reads Microsoft OLE2 files (also called Structured Storage, Compound File Binary Format or Compound Document File Format). This is the underlying format of Outlook's .msg files. This library currently supports up to Python 2.7 and 3.4.

The script was built using Peter Fiskerstrand's documentation of the .msg format. Redemption's discussion of the different property types used within Extended MAPI was also useful. For future reference, I note that Microsoft have opened up their documentation of the file format.

So I tried that script. When I tried to decode the .msg file, though, I saw the error message below:

$ ./ExtractMsg.py.sav Re_Netbond.msg
Error with file 'Re_Netbond.msg': Traceback (most recent call last):
  File "./ExtractMsg.py.sav", line 539, in <module>
    msg.save(toJson, useFileName)
  File "./ExtractMsg.py.sav", line 434, in save
    f.write(self.body)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 172: ordinal not in range(128)

$

The script did create a directory with files in it, though. The directory name was today's date followed by the file name. Within that directory was a message.text file and another directory.

$ ls -ldg 2018-02-02_1317\ Re\ Netbond/
drwxr-xr-x  4 1286109195  136 Feb  2 15:50 2018-02-02_1317 Re Netbond/
$ ls -l 2018-02-02_1317\ Re\ Netbond/
total 8
-rw-r--r--   1 jasmith1  1286109195   217 Feb  2 15:50 message.text
drwxr-xr-x  58 jasmith1  1286109195  1972 Feb  2 15:50 raw
$

The message.text file contained the "from", "to", "subject" and "date" lines for the email message contained in the .msg file, but the body text was missing. The error message that was produced when I ran the script referenced line 434 in the script:

  File "./ExtractMsg.py.sav", line 434, in save
    f.write(self.body)

I was able to eliminate the error by substituting the following 3 lines of code for the line containing f.write(self.body) based on the suggestion by agf at UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) :

bodytext = self.body
bodytext = bodytext.encode('utf-8')
f.write(bodytext)

When I deleted the directory previously created by the script and all its contents and reran the script, I didn't see any error message.

$ ./ExtractMsg.py Re_Netbond.msg
$

The output directory then contained only the message.text file and that file contained the "from", "to", "subject", and "date" for the email message contained in the .msg file and, this time, the body of the message as well.

After that worked, I then looked at the pull requests for the project on GitHub and found Tyler Williamson had submited a pull request 3 days ago, hange str() to .encode('utf-8') to avoid UnicodeEncodeError regarding the error. He had changed the self.write(self.body) line to f.write(self.body.encode('utf-8')). When I used his version of the script, it worked fine. As with my version, I saw "^M" at the end of lines in the body of the message in the message.text file when I edited it with the vi text editor. Looking at the contents of the file with the od command, I could see that there was a carriage return and newline, aka line feed, at the end of each line, i.e., \r \n, which is the way Microsoft Windows systems deal with line endings, whereas OS X uses just a carriage return character, i.e., hexadecimal 0D - see OS X Line Endings. I was able to get rid of the ^M characters using the vi editor command below. Note: you have to use Ctrl-v Ctrl-m to get the ^M rather than hitting the ^ and M keys.

:1,$ s/^M//g

Until Matthew Walker incorporates the change into his version of the script, you can avoid the unicode error message by using Tyler Williamson's version available at msg-extractor. I've also placed copies of the zip file from his repository and just the script at the links below:

msg-extractor-master.zip
ExtractMsg.py

Note: to use either version, you will need to have the olefile Python package installed. That package can be installed on a Mac OS X system with sudo easy_install olefile - see Using olefile to obtain metadata from an OLE CDF V2 file or with pip, if you have that package manager installed. I tested the script on an OS X system with Python 2.7.10 installed.