I needed a way to determine a file's type within a
Python
script when I can't rely on the file's extension to determine the
file format. I'll be
running the script on a MacBook Pro laptop running the
OS X El Capitan
operating system.
OS X/macOS, like
Linux, comes with the
file command, so
I could run that command at a
shell prompt to
have the utility check the
magic number in the files I'm interested in, but I want to do some
additional processing of the files within the Python script, so I want to
perform the format check within Python. Python provides the
subprocess
module that provides the capability to "spawn new processes, connect
to their input/output/error pipes, and obtain their return codes." So I can
call the file utility from
within Python using that module. To get the results from running a shell
command, you use suprocess.Popen()
. You can then set a
variable
to hold the results of
.communicate() and print the contents of that variable as shown below.
The script expects the name of the file to be checked to be provided as
an
argument on the command line.
#!/usr/bin/python import subprocess as sub, sys try: sys.argv[1] except IndexError: print "Error - missing input file name! Usage ./filetype.py infile" sys.exit(1) else: fileName = sys.argv[1] p = sub.Popen(['file',fileName],stdout=sub.PIPE,stderr=sub.PIPE) output, errors = p.communicate() print output
I wanted to use the script to determine which files that have a .bin
extension are actually
Microsoft Visio files. When I
extract
embedded documents from an Excel .xlsm file, the files appear in
an embeddings
directory. Some of the files are
Micrrosoft
PowerPoint files and some are Visio files, but all of them have a .bin
extension. E.g.:
$ file *.bin oleObject1.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Title: PowerPoint Presentation, Author: Tracy Willard, Last Saved By: Windows User, Revision Number: 11, Name of Creating Application: Microsoft Offi ce PowerPoint, Total Editing Time: 03:42:00, Create Time/Date: Mon Jan 7 00:03: 32 2013, Last Saved Time/Date: Thu Jun 15 18:53:09 2017, Number of Words: 50 oleObject2.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Title: PowerPoint Presentation, Author: Tracy Willard, Last Saved By: Windows User, Revision Number: 7, Name of Creating Application: Microsoft Offic e PowerPoint, Total Editing Time: 03:08:48, Create Time/Date: Mon Jan 7 00:04:0 6 2013, Last Saved Time/Date: Thu Jun 15 17:15:32 2017, Number of Words: 21 oleObject3.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Title: PowerPoint Presentation, Author: Tracy Willard, Last Saved By: Windows User, Revision Number: 14, Name of Creating Application: Microsoft Offi ce PowerPoint, Total Editing Time: 03:28:03, Create Time/Date: Mon Jan 7 00:03: 32 2013, Last Saved Time/Date: Thu Jun 15 17:15:32 2017, Number of Words: 33 oleObject4.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:15:31 2017 oleObject5.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:15:31 2017 oleObject6.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:12:19 2017 oleObject7.bin: CDF V2 Document, No summary info $
I can tell that oleObject6.bin
is a Microsoft Visio file, since
I see "Name of Creating Application: Microsoft Visio" for that file in
the output of the file
command. And I can get that same information from within the Python
script using the subprocess module.
$ file oleObject6.bin oleObject6.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:12:19 2017 $ ./filetype.py oleObject6.bin oleObject6.bin: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code p age: 1252, Author: Windows User, Last Saved By: Windows User, Name of Creating A pplication: Microsoft Visio, Last Saved Time/Date: Thu Jun 15 17:12:19 2017 $
Since I'm only interested in the file format, i.e., I only want to know which .bin files are actually PowerPoint or Visio files, I modified the script to display just that information for those types of files.
#!/usr/bin/python import subprocess as sub, sys try: sys.argv[1] except IndexError: print "Error - missing input file name! Usage ./filetype.py infile" sys.exit(1) else: fileName = sys.argv[1] p = sub.Popen(['file',fileName],stdout=sub.PIPE,stderr=sub.PIPE) output, errors = p.communicate() searchString = "Name of Creating Application: " creatingAppIndex = output.find(searchString) if creatingAppIndex != -1: startAppIndex = creatingAppIndex + len(searchString) endAppIndex = output.find(",",startAppIndex) creatingApp = output[startAppIndex:endAppIndex] print creatingApp else: print output
The
find method can be used with
strings
to determine if a particular substring you are interested in occurs within
a string. If the substring is present, the find method will return the index
in the string where the substring starts (the first character in the string is
considered to be at position zero). If the substring isn't present the value
returned by the find method is minus 1. So, if I look for "Name of Creating
Application: " and the index returned is not -1, then I can add the length
of "Name of Creating Application: ", which is 30 characters in length, to the
index value to find the actual name of the creating application. I can then
look for a comma, which appears immediately after the name of the creating
application using the find method on the contents of the
variable
named output
, but specify that I want to start the search
at startAppIndex
, since the find method has the syntax
str.find(str, begin, end)
where begin is the first
character in the string by default, if not specified, and end is the
last character in the string by default, if not specified. I can then
get the name of the creating application by using
creatingApp = output[startAppIndex:endAppIndex]
, to set the
variable creatingApp
to be just the portion of the output from the
file
command that extends from the beginning of the application
name up to the comma.
The output I get for the files is now as shown below:
$ ./filetype.py oleObject1.bin Microsoft Office PowerPoint $ ./filetype.py oleObject2.bin Microsoft Office PowerPoint $ ./filetype.py oleObject3.bin Microsoft Office PowerPoint $ ./filetype.py oleObject4.bin Microsoft Visio $ ./filetype.py oleObject5.bin Microsoft Visio $ ./filetype.py oleObject6.bin Microsoft Visio $ ./filetype.py oleObject7.bin oleObject7.bin: CDF V2 Document, No summary info $