Determining a file's type from within a Python script

I needed a way to determine a file's type within a Python script when I can't rely on the file's extension to determine the file format. I'll be running the script on a MacBook Pro laptop running the OS X El Capitan operating system. OS X/macOS, like Linux, comes with the file command, so I could run that command at a shell prompt to have the utility check the magic number in the files I'm interested in, but I want to do some additional processing of the files within the Python script, so I want to perform the format check within Python. Python provides the subprocess module that provides the capability to "spawn new processes, connect to their input/output/error pipes, and obtain their return codes." So I can call the file utility from within Python using that module. To get the results from running a shell command, you use suprocess.Popen(). You can then set a variable to hold the results of .communicate() and print the contents of that variable as shown below. The script expects the name of the file to be checked to be provided as an argument on the command line.

#!/usr/bin/python

import subprocess as sub, sys

try:
   sys.argv[1]
except IndexError:
   print "Error - missing input file name! Usage ./filetype.py infile"
   sys.exit(1)
else:
   fileName = sys.argv[1]

p = sub.Popen(['file',fileName],stdout=sub.PIPE,stderr=sub.PIPE)
output, errors = p.communicate()
print output

I wanted to use the script to determine which files that have a .bin extension are actually Microsoft Visio files. When I extract embedded documents from an Excel .xlsm file, the files appear in an embeddings directory. Some of the files are Micrrosoft PowerPoint files and some are Visio files, but all of them have a .bin extension. E.g.:

$ file *.bin
oleObject1.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Title: PowerPoint Presentation, Author: Tracy Willard, Last Saved By:
 Windows User, Revision Number: 11, Name of Creating Application: Microsoft Offi
ce PowerPoint, Total Editing Time: 03:42:00, Create Time/Date: Mon Jan  7 00:03:
32 2013, Last Saved Time/Date: Thu Jun 15 18:53:09 2017, Number of Words: 50
oleObject2.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Title: PowerPoint Presentation, Author: Tracy Willard, Last Saved By:
 Windows User, Revision Number: 7, Name of Creating Application: Microsoft Offic
e PowerPoint, Total Editing Time: 03:08:48, Create Time/Date: Mon Jan  7 00:04:0
6 2013, Last Saved Time/Date: Thu Jun 15 17:15:32 2017, Number of Words: 21
oleObject3.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Title: PowerPoint Presentation, Author: Tracy Willard, Last Saved By:
 Windows User, Revision Number: 14, Name of Creating Application: Microsoft Offi
ce PowerPoint, Total Editing Time: 03:28:03, Create Time/Date: Mon Jan  7 00:03:
32 2013, Last Saved Time/Date: Thu Jun 15 17:15:32 2017, Number of Words: 33
oleObject4.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A
pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:15:31 2017
oleObject5.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A
pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:15:31 2017
oleObject6.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A
pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:12:19 2017
oleObject7.bin: CDF V2 Document, No summary info
$

I can tell that oleObject6.bin is a Microsoft Visio file, since I see "Name of Creating Application: Microsoft Visio" for that file in the output of the file command. And I can get that same information from within the Python script using the subprocess module.

$ file oleObject6.bin
oleObject6.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A
pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:12:19 2017
$ ./filetype.py oleObject6.bin
oleObject6.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p
age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A
pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:12:19 2017

$

Since I'm only interested in the file format, i.e., I only want to know which .bin files are actually PowerPoint or Visio files, I modified the script to display just that information for those types of files.

Mastering Python
Mastering Python
1x1 px


Python by Example
Python by Example
1x1 px

#!/usr/bin/python

import subprocess as sub, sys

try:
   sys.argv[1]
except IndexError:
   print "Error - missing input file name! Usage ./filetype.py infile"
   sys.exit(1)
else:
   fileName = sys.argv[1]

p = sub.Popen(['file',fileName],stdout=sub.PIPE,stderr=sub.PIPE)
output, errors = p.communicate()

searchString = "Name of Creating Application: "
creatingAppIndex = output.find(searchString)
if creatingAppIndex != -1:
   startAppIndex = creatingAppIndex + len(searchString)
   endAppIndex = output.find(",",startAppIndex)
   creatingApp = output[startAppIndex:endAppIndex]
   print creatingApp
else: 
   print output

The find method can be used with strings to determine if a particular substring you are interested in occurs within a string. If the substring is present, the find method will return the index in the string where the substring starts (the first character in the string is considered to be at position zero). If the substring isn't present the value returned by the find method is minus 1. So, if I look for "Name of Creating Application: " and the index returned is not -1, then I can add the length of "Name of Creating Application: ", which is 30 characters in length, to the index value to find the actual name of the creating application. I can then look for a comma, which appears immediately after the name of the creating application using the find method on the contents of the variable named output, but specify that I want to start the search at startAppIndex, since the find method has the syntax str.find(str, begin, end) where begin is the first character in the string by default, if not specified, and end is the last character in the string by default, if not specified. I can then get the name of the creating application by using creatingApp = output[startAppIndex:endAppIndex], to set the variable creatingApp to be just the portion of the output from the file command that extends from the beginning of the application name up to the comma.

The output I get for the files is now as shown below:

$ ./filetype.py oleObject1.bin
Microsoft Office PowerPoint
$ ./filetype.py oleObject2.bin
Microsoft Office PowerPoint
$ ./filetype.py oleObject3.bin
Microsoft Office PowerPoint
$ ./filetype.py oleObject4.bin
Microsoft Visio
$ ./filetype.py oleObject5.bin
Microsoft Visio
$ ./filetype.py oleObject6.bin
Microsoft Visio
$ ./filetype.py oleObject7.bin
oleObject7.bin: CDF V2 Document, No summary info

$