Tim Retout's www presence

Tue, 29 May 2018

Tokenizing IT jobs

One size does not fit all when it comes to building search applications - it is important to think about the business domain and user expectations. Here's a classic example from recruitment search (a domain which has absorbed six years of my life already...) - imagine you are a candidate searching for IT jobs on your favourite job board.

Recall how a full-text index works as implemented in Solr or Elasticsearch - the job posting documents are treated as a bag of words (i.e. the order of the words doesn't matter in the first instance). When indexing each job, the search engine tokenizes the document to get a list of which words are included. Then, for each individual word we create a list of which documents include each word.

Normally you tell the indexer to exclude so-called "stopwords" which do not provide any useful information to the searcher - e.g. "a", "is", "it", "to", "and". These terms are present in most if not all documents, so would take up a huge amount of space in your index for little benefit. The same stopwords are excluded from queries to reduce the complexity of the search problem.

However, look at the word "it". It matches the term "IT" case-insensitively - and it's quite common for candidates to use lowercase when entering queries. So we want the query [it] to return jobs containing "IT" - this means "it" cannot be a stopword for queries!

To solve this in Solr, we end up doing something much more complicated:

  1. First, "it" is not included in our stopwords list.
  2. At index time, the term "IT" is mapped to "informationtechnology", case-sensitively. (I believe this is so that phrase matches might work? You can ensure that the phrase "Information Technology" maps to the same token.)
  3. At query time, the term "it" and similar is mapped to the same token.

To implement this in Solr, use a separate analyzer for index/query time on the field, pointing at different synonym files.

While the implementation is quite ugly, the principle is simple: the recruiter and the candidate intended different things when writing the job posting versus the query, and we need to handle each according to the intention of the author. For a different application that had nothing to do with IT, you could safely ignore the word "it".

Posted: 29 May 2018 09:55 | Tags: , ,

Mon, 02 Dec 2013

How not to parse search queries

While I remember, I have uploaded the slides from my talk about Solr and Perl at the London Perl Workshop.

This talk was inspired by having seen and contributed to at least five different sets of Solr search code at my current job, all of which (I now believe) were doing it wrong. I distilled this hard-won knowledge into a 20 minute talk, which - funny story - I actually delivered twice to work around a cock-up in the printed schedule. I don't believe any video was successfully taken, but I may be proved wrong later.

I have also uploaded the Parse::Yapp grammar mentioned in the talk.

In case you don't have time to read the slides, the right way to present Solr via Perl is to use the 'edismax' parser, and write your code a bit like this:

my $solr = WebService::Solr->new($url);
my $s = $query->param('q');

# WebService::Solr::Query objects are useful for
# 'fq' params, but avoid them for main 'q' param.
my $options = {
 fq => [WebService::Solr::Query->new(...)];
};

$solr->search($s, \%options);

The key thing here is not to put any complicated parsing code in between the user and Solr. Avoid Search::QueryParser at all costs.

Posted: 02 Dec 2013 22:58 | Tags: , ,

Contact

Tim Retout tim@retout.co.uk
JabberID: tim@retout.co.uk

Comments

I'm afraid I have turned off comments for this blog, because of all the spam. Let's face it, I didn't read them anyway. Feel free to email me.

Me Elsewhere

Copyright © 2007-2014 Tim Retout