Tim Retout's www presence

Tue, 29 May 2018

Tokenizing IT jobs

One size does not fit all when it comes to building search applications - it is important to think about the business domain and user expectations. Here's a classic example from recruitment search (a domain which has absorbed six years of my life already...) - imagine you are a candidate searching for IT jobs on your favourite job board.

Recall how a full-text index works as implemented in Solr or Elasticsearch - the job posting documents are treated as a bag of words (i.e. the order of the words doesn't matter in the first instance). When indexing each job, the search engine tokenizes the document to get a list of which words are included. Then, for each individual word we create a list of which documents include each word.

Normally you tell the indexer to exclude so-called "stopwords" which do not provide any useful information to the searcher - e.g. "a", "is", "it", "to", "and". These terms are present in most if not all documents, so would take up a huge amount of space in your index for little benefit. The same stopwords are excluded from queries to reduce the complexity of the search problem.

However, look at the word "it". It matches the term "IT" case-insensitively - and it's quite common for candidates to use lowercase when entering queries. So we want the query [it] to return jobs containing "IT" - this means "it" cannot be a stopword for queries!

To solve this in Solr, we end up doing something much more complicated:

  1. First, "it" is not included in our stopwords list.
  2. At index time, the term "IT" is mapped to "informationtechnology", case-sensitively. (I believe this is so that phrase matches might work? You can ensure that the phrase "Information Technology" maps to the same token.)
  3. At query time, the term "it" and similar is mapped to the same token.

To implement this in Solr, use a separate analyzer for index/query time on the field, pointing at different synonym files.

While the implementation is quite ugly, the principle is simple: the recruiter and the candidate intended different things when writing the job posting versus the query, and we need to handle each according to the intention of the author. For a different application that had nothing to do with IT, you could safely ignore the word "it".

Posted: 29 May 2018 09:55 | Tags: , ,

Mon, 14 Aug 2017

Jenkins milestone steps do not work yet

Public Service Announcement for anyone relying on Jenkins for continuous deployment - the milestone step plugin as of version 1.3.1 will not function correctly if you could have more than two builds running at once - older builds could get deployed after newer builds.

See JENKINS-46097.

A possible workaround is to add an initial milestone at the start of the pipeline, which will then allow builds to be killed early. (Builds are only killed early once they have passed their first milestone.)

Going by the source history, I reckon this bug has been present since the milestone-step plugin was created.

Posted: 14 Aug 2017 15:17 | Tags:

Tue, 25 Apr 2017

Packet.net arm64 servers

Packet.net offer an ARMv8 server with 96 cores for $0.50/hour. I signed up and tried building Libreoffice to see what would happen. Debian isn't officially supported there yet, but they offer Ubuntu, which suffices for testing the hardware.

Screenshot of htop showing one core in use and 95 idle.

Final build time: around 12 hours, compared to 2hr 55m on the official arm64 buildd.

Most of the Libreoffice build appeared to consist of "touch /some/file" repeated endlessly - I have a suspicion that the I/O performance might be low on this server (although I have no further evidence to offer for this). I think the next thing to try is building on a tmpfs, because the server has 128GB RAM available, and it's a shame not to use it.

Posted: 25 Apr 2017 12:38 | Tags:

Sun, 01 Jan 2017

Happy New Year!

Happy New Year!

Apparently I failed to write a blog entry in all of 2016, and almost all of 2015. Probably says something profound about the rise of social media, or perhaps I was just very busy. I bet my writing has suffered.

I have spent the last few days tidying up and clearing out clothes, bits of paper, and wires. I think there's light at the end of the tunnel.

Posted: 01 Jan 2017 23:17 | Tags:

Sat, 17 Jan 2015

CPAN PR Challenge - January - IO-Digest

I signed up to the CPAN Pull Request Challenge - apparently I'm entrant 170 of a few hundred.

My assigned dist for January was IO-Digest - this seems a fairly stable module. To get the ball rolling, I fixed the README, but this was somehow unsatisfying. :)

To follow-up, I added Travis-CI support, with a view to validating the other open pull request - but that one looks likely to be a platform-specific problem.

Then I extended the Travis file to generate coverage reports, and separately realised the docs weren't quite fully complete, so fixed this and added a test.

Two of these have already been merged by the author, who was very responsive.

Part of me worries that Github is a centralized, proprietary platform that we now trust most of our software source code to. But activities such as this are surely a good thing - how much harder would it be to co-ordinate 300 volunteers to submit patches in a distributed fashion? I suppose you could do something similar with the list of Debian source packages and metadata about the upstream VCS, say...

Posted: 17 Jan 2015 22:01 | Tags:

Thu, 15 Jan 2015

Docker London Meetup - January 2015

Last week, I visited London for the January Docker meetup, which was the first time I'd attended this group.

It was a talk-oriented format, with around 200 attendees packed into Shoreditch Village Hall; free pizza and beer was provided thanks to the sponsors, which was awesome (and makes logistics easier when you're travelling there from work).

There were three talks.

First, Andrew Martin from British Gas spoke about how they use Docker for testing and continuous deployment of their Node.js microservices - buzzword bingo! But it's helpful to see how companies approach these things.

Second, Johan Euphrosine from Google gave a short demo of Google Cloud Platform for running Docker containers (mostly around Container Engine, but also briefly App Engine). This was relevant to my interests, but I'd already seen this sort of talk online.

Third, Dan Williams presented his holiday photos featuring a journey on a container ship, which wins points from me for liberal interpretation of the meetup topic, and was genuinely very entertaining/interesting - I just regret having to leave to catch a train halfway through.

In summary, this was worth attending, but as someone just getting started with containers I'd love some sort of smaller meetings with opportunities for interaction/activity. There's such a variety of people/use cases for Docker that I'm not sure how much everyone had in common with each other; it would be interesting to find out.

Posted: 15 Jan 2015 07:45 | Tags: ,

Fri, 02 Jan 2015


Kate's been reading some book or other by KonMari. Hence we've rehomed lots of clothes, books and DVDs to charity and various places.

I am told the key is to ask, "Does this item bring me joy?" Then if it doesn't bring you enough joy, it goes. The nice thing was, it was actually exciting to reveal the gems among my bookshelves, which were previously hidden by a load of second-rate books.

True story: I was sitting downstairs deciding whether to splash out £25 for a particular book. Was called upstairs to make some 'joy decisions', and saw the very same book on the shelf already. Fast delivery!

Posted: 02 Jan 2015 21:52 | Tags:

Thu, 01 Jan 2015

Looking back at 2014

I have a tendency to forget what I've been up to - so I made a list for 2014.

I started the year having recently watched many 30c3 videos online - these were fantastic, and I really should get round to the ones from 31c3. January is traditionally the peak time for the recruitment industry, so at work we were kept busy dealing with all the traffic. We'd recently switched the main job search to use Solr rather than MySQL, which helped - but we did spend a lot of time during the early months of the year converting tables from MyISAM to InnoDB.

At the start of February was FOSDEM, and Kate and I took Sophie (then aged 10 months) to her first software conference. I grabbed a spot in the Go devroom for the Sunday afternoon, which was awesome. Downside: we got horribly ill while in Brussels.

At work I was sorting out configuration management - this led to some Perl module backporting for Debian, and I uploaded Zookeeper at some point during the year as well. We currently make use of vagrant, chef and a combination of Debian packages and cpanm for Perl modules, but I have big plans to improve on that this year.

Over a break from work I hacked up apt-transport-tor, which lets you install Debian packages over the Tor network. (This was inspired by videos from 30c3 and/or LibrePlanet, I think?) Continuing the general theme of paranoia, I attended the Don't Spy On Us campaign's day of action in June.

Over the summer at work I was experimenting with Statsd and Graphite for monitoring. I also wrote Toggle, a Perl module for feature flags. In July I attended a London.pm meeting for the first time, and heard Thomas Klausner talk about OX - this nudged me into various talks at LPW (see below). Pubs have a lot to answer for.

At some point I got an IPv6 tunnel working at home (although my ISP-provided router's wireless doesn't forward it), and I had an XBMC install going on a Raspberry Pi as another fun hack.

In August and September I worked on packaging pump.io for Debian, and attended IndieWebCamp Brighton, where I delivered a talk/workshop on setting up TLS. (This all ties in to the paranoia theme.) I stalled the work on pump.io, partly because of licensing issues at build-dependency time (if you want to run all the tests) - but I expect I'll pick this up in 2015 once jessie is released.

November was the London Perl Workshop, where I presented my work from the summer on statsd/graphite and Toggle, and a Bread::Board lightning talk. LPW was more enjoyable for me this year than previous years, probably because of the interesting people discussing various aspects of how feature flags ought to work. Simultaneously was the Cambridge MiniDebConf (why do these always clash?) where I think I fixed at least one RC bug.

This is not an exhaustive list of everything I've done this year - there are more changes now lined up for 2015 which I haven't shared yet. But looking back, I'm pleased that the many small experiments I get up to do add up to something over time, and I can see that I'm achieving something. Here's to another year!

Posted: 01 Jan 2015 21:38 | Tags:

Sun, 31 Aug 2014

Website revamp

This weekend I moved my blog to a different server. This meant I could:

I've tested it, and it's working. I'm hoping that I can swap out the Node.js modules one-by-one for the Debian-packaged versions.

Posted: 31 Aug 2014 22:04 | Tags: , ,

Thu, 28 Aug 2014

Pump.io update 1

[The story so far: I'm packaging pump.io for Debian.]

4 packages uploaded to NEW:

  • node-webfinger
  • validator.js
  • websocket-driver
  • node-openid

2 packages eliminated as not needed:

  • set-immediate - deprecated
  • crypto-cacerts - not needed on Debian

1 package in progress:

  • node-databank

Got my eye on:

  • oauth-evanp - this is a fork with two patches, so I need to investigate the status of those.
  • node-iconv-lite - needs files downloaded from the internet, so I'm considering how to add them to the source package
  • dateformat/moment - there's an open discussion about combining Node.js modules, and I'm wondering if these are affected.


Currently I'm averaging around one package upload a day, I think? Which would mean ~1 month to go? But there may be challenges around getting packages through the NEW queue in time to build-depend on them.

Someone has asked my temporary Twitter account whether I have a pump.io account. Technically, yes, I do - but I don't post anything on it, because I want to run my own server in the long term. As part of running my own server, I always find that easier if I'm installing software from Debian packages. Hence this work. Sledgehammer, meet nut.

Posted: 28 Aug 2014 13:59 | Tags: ,


Tim Retout tim@retout.co.uk
JabberID: tim@retout.co.uk


I'm afraid I have turned off comments for this blog, because of all the spam. Let's face it, I didn't read them anyway. Feel free to email me.

Me Elsewhere

Copyright © 2007-2014 Tim Retout