Why search engines struggle to work inside a company?

Whenever we look for daily answers in our private lives, a search engine is the primary source of information. We simply type in a few keywords and an answer is just a few clicks away. Why can't we do the same in the office?

There are several ways in which the public web differs from an enterprise environment. The two most obvious are in security and proprietary content formats - very few business applications have a web interface that allows it to be crawled (visited and indexed by a search engine). However, if these were the only two issues, an enterprise search would just be expensive and technologically challenging to implement and I don't think that's where the body lies buried. (To be honest, the software mega-vendors do offer such solutions and the obstacles of security and integration can be overcome.)

When you search the Internet, you're satisfied with the first document which addresses the topic in question. On the other hand, when you're looking for a specific price list or an up-to-date version of project X's charter, you're not happy with the first similar price list or project charter you find - you want the EXACT one. This makes the search task a completely different story, probabilistic models and heuristics cannot deliver answers, because they can't ever deliver 100% accurate results.

Let's say that we lower our expectations about search results and don't expect an exact match. Still, we'd probably still expect the number of items in the result list to be low - for me that would be 20-30 relevant documents to choose from. Unfortunately, in order to reach this level of accuracy, a pure fulltext search isn't enough. Commonly used business words (e.g. project, revenue, cost, report …) are too ubiquitous and simply appear too often to sufficiently narrow down a search result set. Not only that but the number of documents and their user traffic is just too small for statistics to help us out.

The answer to creating an efficient enterprise search-engine lies in combining a fulltext search with the possibility of querying the content's structural information, like format, origin, and relationships to other content etc. For example: I need some quarterly sales numbers and I remember that at the last quarterly meeting it was presented in an Excel format by a guy from controlling. I would query "quarterly sales" and then refine the search to include only reports in an Excel format owned by the controlling department. If I was looking for more detailed sales numbers and wanted to drill down to the level of the actual deals and sales themselves, I might try to focus on a data table in a data warehouse or ERP.

If this sounds like science fiction to you, send me a message and I'd be happy to demonstrate to you that this approach really works and you can spend your working hours doing something else other than clicking through shared folders and looking for that bloody file that should be there somewhere :)