By the Price of Business Show, Hosted by Kevin Price.  The Price of Business is a media partner of this site. 

When you think of your files, you probably thinking of what your files look like inside their associated applications like Adobe Reader or Microsoft Word, Excel, Access, PowerPoint, OneNote, Outlook or Exchange. A search engine like dtSearch, however, will have a very different perspective on your files and emails.

To search all of your files at once, a search engine cannot individually retrieve each file in its associated application. That would take “forever.” Instead, a search engine has to look at documents, emails and the like in their binary formats. But while a word processing document is easily readable in Microsoft Word, in binary format you’d be hard-pressed to pick out any text at all. In fact, to the naked eye, modern binary formats read largely like gibberish.

To be able to parse those binary formats, a search engine like dtSearch has to use a component of its software called the document filters to apply the correct specification to that binary format type. Different file types have very different specifications. The Microsoft Excel specification is very different from the Microsoft Outlook specification, which is in turn very different from the PDF specification.

You might think that the filename extension tells a search engine’s document filters which file format specification to apply. So, a .DOCX filename extension would indicate a Microsoft Word document and a .PDF extension would indicate a PDF file. But what if someone applies the wrong extension to a file, giving a Microsoft Word document a .PDF extension?

The way that the dtSearch document filters figure out which specification to apply is to look at the binary file header. That way, regardless of the filename extension, dtSearch can determine the document or email type correctly. Also, keep in mind that file types evolve. For example, there is brand new PDF 2.0 specification which is similar to the PDF 1.x specification that has applied for decades, but not exactly the same.

Because a search engine goes to the binary format of your data instead of the associated application view, data that may be obscure in the associated application view would nonetheless be readily apparent to a search engine. That includes text that may be beyond the ordinary page view of a PDF file. Or it includes an email that has a ZIP attachment with a Microsoft Word document and inside that an embedded Excel file. It also includes obscure metadata, and even “black on black” text or “white on white” text

For all these reasons, it is important to be aware that the binary file view that a search engine will see may be quite broader than the associated application view that you may see. Large enterprises like government agencies and 4 out of 5 of the Fortune 500’s largest Aerospace and Defense companies use dtSearch enterprise and developer products to instantly search terabytes of binary “Office” files, emails, databases and web data. However, even if you just want to search binary files and emails on your own PC, you can download a fully-functional 30-day evaluation version of dtSearch Desktop at dtSearch.com.

About Author

Leave a Reply

Your email address will not be published. Required fields are marked *

RSS
Follow by Email
YouTube
YouTube
LinkedIn
LinkedIn
Share