Tokenizing IT jobs | Tim Retout

One size does not fit all when it comes to building search applications - it is important to think about the business domain and user expectations. Here's a classic example from recruitment search (a domain which has absorbed six years of my life already...) - imagine you are a candidate searching for IT jobs on your favourite job board.

Recall how a full-text index works as implemented in Solr or Elasticsearch - the job posting documents are treated as a bag of words (i.e. the order of the words doesn't matter in the first instance). When indexing each job, the search engine tokenizes the document to get a list of which words are included. Then, for each individual word we create a list of which documents include each word.

Normally you tell the indexer to exclude so-called "stopwords" which do not provide any useful information to the searcher - e.g. "a", "is", "it", "to", "and". These terms are present in most if not all documents, so would take up a huge amount of space in your index for little benefit. The same stopwords are excluded from queries to reduce the complexity of the search problem.

However, look at the word "it". It matches the term "IT" case-insensitively - and it's quite common for candidates to use lowercase when entering queries. So we want the query [it] to return jobs containing "IT" - this means "it" cannot be a stopword for queries!

To solve this in Solr, we end up doing something much more complicated:

First, "it" is not included in our stopwords list.
At index time, the term "IT" is mapped to "informationtechnology", case-sensitively. (I believe this is so that phrase matches might work? You can ensure that the phrase "Information Technology" maps to the same token.)
At query time, the term "it" and similar is mapped to the same token.

To implement this in Solr, use a separate analyzer for index/query time on the field, pointing at different synonym files.

While the implementation is quite ugly, the principle is simple: the recruiter and the candidate intended different things when writing the job posting versus the query, and we need to handle each according to the intention of the author. For a different application that had nothing to do with IT, you could safely ignore the word "it".