Your Files are a Lot Less Flat Than You Think
By Elizabeth Thede, Special for The Daily Blaze
The ordinary files and emails that populate your data are a lot less flat and a lot more multidimensional than you may realize. When you think of a single file or email, you probably think of how that file or email appears inside its native application. So you may think of PDFs as they appear in a PDF viewer like Adobe Reader; Microsoft Office files as they appear in Word, Access, PowerPoint, Excel and OneNote; and emails as they appear in Outlook.
But there is a whole alternate reality to these files and emails apart from when they are on display in their native applications, when they are just sitting on your harddrive or network – or available in another context such as embedded as so-called BLOB data in a database. That alternate reality is their binary format existence, and this is what a search engine like dtSearch® sees. The same text that is easily readable in its native application translates to a blur of random codes in binary format. In fact, in binary format, it can be difficult to impossible to pick out any regular sentences or other standard text at all.
These binary files are anything but flat. To start with, you can have a binary file within another binary file, like a Microsoft Excel spreadsheet embedded in a Microsoft Word document. Or you can have multilevel nested attachments, like an email message with a ZIP or a RAR attachment containing files which themselves may be multidimensional. A search engine needs to parse all of the different binary format containers to search the full text of everything.
To parse these containers, a search engine needs to go through every level and figure out the correct specification to apply. The specifications for each type of Microsoft Office documents are themselves very different. And these are in turn very different from the specifications for PDFs which are in turn very different from the specifications for email files. And then if you add another layer like ZIP or RAR compression, that too requires separate parsing.
Using the file extensions to parse the data would depend on everyone giving each layer the correct extension. And conversely, it would allow someone to hide a file from a search engine by just giving it the wrong extension, like naming a PowerPoint with a .PDF extension. So a search engine like dtSearch has to figure out the multidimensional layers by reviewing the binary code itself apart from any filename extensions.
Turing to metadata, while the default native application display of a file will typically make the main text perfectly clear, the native application view can sometimes obscure some of the metadata associated with that file. But the binary format makes this metadata readily apparent to a search engine, even if it does not by default appear in a native application file display.
One other point about binary files. Sometimes people try to hide information in a file by printing white text on a white background or black text on a black background. While white on white or black on black text may appear invisible when you look at a file in its native application, in binary format, this type of text is as apparent as normal black on white writing to a search engine.
dtSearch offers enterprise and developer applications to instantly search terabytes of “Office” files, PDFs, popular compression formats, emails, databases and Internet or Intranet data. dtSearch software can run “on premises” or in a cloud environment like Azure or AWS. Because dtSearch can instantly search terabytes, many dtSearch customers are large enterprises like Fortune 100 companies and federal, state and international government agencies. But anyone can download a fully-functional 30-day evaluation version at dtSearch.com to instantly search data.
You can see why running dtSearch is an alternative to getting organized. From dtSearch.com (https://www.dtsearch.com), you can immediately download and try a fully-functional 30-day evaluation version to instantly search terabytes of your own data.