The breakdown is this: give me a relatively recent Microsoft Word Document (.doc) and I can tell you what word processor last edited it.
I studied information leakage. There are a plethora of examples of cases where information that wasn’t supposed to be revealed, was. One could argue that it’s the user’s fault for not correctly sanitizing their documents, but I blame WYSIWYG editors. Back in the day of pen and paper, if someone wanted to redact information from a document she was releasing, all she had to do was take a black market and cross it out. For extra security, she could make a photocopy of the original and only release the photocopy. WYSIWYG editors try to imitate paper in that the document being edited is in theory the one being published, but especially with redacting information, there’s a failure to communicate to the user what’s actually going on. In a WYSIWYG editor, one can’t just put a black box over information to redact it. The same goes for putting a black background on text. The problem: the information is still there.
The stories about information being incorrectly redacted are more high profile and glamorous, but metadata leakage can also be embarrassing. Metadata can be thought of as data about data. When you create a file, the program that created it stores some identifying information–for example title, author, date of creation. It stores data about the data you just made. I talked earlier about how technology can be seen as like magic and just working. Again, the problem is if one thinks of technology this way, privacy and security are never questioned. In this project I examined Microsoft Word Documents–one of the most common file formats for editing and publishing text documents. Word stores metadata and in a world increasingly worried about metadata, Microsoft offers advice on how to sanitize documents of metadata. While clicking around Microsoft’s help pages, I came across the following snippet:
“Some metadata is readily accessible through the user interface of each Office program. Other metadata is only accessible through extraordinary means, such as opening a document in a low-level, binary file editor.”
Extraordinary means? Thus I set forth tying to determine whether this Computer Science undergraduate could find the metadata Microsoft referred to using “extraordinary means” (also known as the Unix tools strings and octal dump).
What I found was quite fun. Microsoft Word Documents (of the .doc variety–.docx is an entirely different beast) differ enough on the binary/octal level differ enough so that I can identify Word files created by Microsoft Office 2003, 2004, 2007, 2008, OpenOffice, and Google Docs. A quick tip on identifying Office version: Microsoft always releases the Windows version the year before the Mac version. Thus Office 2003 and 2007 are the Windows versions and Office 2004 and 2008 are the Mac versions. There are major differences in structure between Windows and Mac Office-produced Word documents and definitely differences between each version. Microsoft Office is a minor nightmare from a backwards compatibility standpoint, so I don’t blame Microsoft for having convoluted file formats (fun fact: Word documents alternate between UTF-8 and UTF-16 encoding). It turns out that when one version of Office (say 2004) opens and saves a Word file created by another version of Office (say 2003), the file structure will be converted from 2003 to 2004. It is possible to create an operating system neutral word processor though: I couldn’t tell the difference between OpenOffice Word files created on Windows computers or Macs. It goes without saying that OpenOffice and Google Docs produced Word files that look very different on a binary level from the Microsoft ones.
I recognize that looking at Word documents at this close of a level is beyond most Word users’ abilities or desires, but I’m also surprised how easy it was to find differences in the file formats. Microsoft Word stores unintended metadata about what word processor you used to last edit a document. This is troubling since Microsoft has tools that are supposed to strip metadata from documents, but this just goes to show that metadata is embedded deep into documents. I’m guessing that one of the reasons Word moved to a .docx format was because .doc was becoming too cumbersome to deal with. It’s very possible that .docx is operating system and Office version neutral. I definitely don’t think that Microsoft was sloppy in creating the .doc format, I just believe that in most moderately complicated file formats constructed in an environment where privacy isn’t paramount, there will be traces of hidden metadata.
This was one of the two projects I did at Princeton. The other, on RFID security, can be found here.



