One size does not fit all when it comes to building search
applications - it is important to think about the business domain and
user expectations. Here's a classic example from recruitment search
(a domain which has absorbed six years of my life already...) -
imagine you are a candidate searching for IT jobs
on your favourite job board.
Recall how a full-text index works as implemented in Solr or
Elasticsearch - the job posting documents are treated as a bag of
words (i.e. the order of the words doesn't matter in the first
instance). When indexing each job, the search engine tokenizes the
document to get a list of which words are included. Then, for each
individual word we create a list of which documents include each
Normally you tell the indexer to exclude so-called "stopwords"
which do not provide any useful information to the searcher -
e.g. "a", "is", "it", "to", "and". These terms are present in most if
not all documents, so would take up a huge amount of space in your
index for little benefit. The same stopwords are excluded from
queries to reduce the complexity of the search problem.
However, look at the word "it". It matches the term "IT"
case-insensitively - and it's quite common for candidates to use
lowercase when entering queries. So we want the query [it] to return
jobs containing "IT" - this means "it" cannot be a stopword for
To solve this in Solr, we end up doing something much more
- First, "it" is not included in our stopwords list.
- At index time, the term "IT" is mapped to "informationtechnology",
case-sensitively. (I believe this is so that phrase matches might
work? You can ensure that the phrase "Information Technology" maps to
the same token.)
- At query time, the term "it" and similar is mapped to the same token.
To implement this in Solr, use a separate analyzer for index/query
time on the field, pointing at different synonym files.
While the implementation is quite ugly, the principle is simple:
the recruiter and the candidate intended different things when writing
the job posting versus the query, and we need to handle each according
to the intention of the author. For a different application that had
nothing to do with IT, you could safely ignore the word "it".