Tim Retout's www presence

Sat, 15 Feb 2014

Backporting some Perl modules

I've started backporting some Perl modules to wheezy-backports - for starters, libbread-board-perl, which is now waiting in BACKPORTS-NEW.

At work I've recently been trying to automate the deployment of our platform, and was originally trying to use Carton to manage the CPAN dependencies for us. It seems like it ought to be possible to make this work using CPAN-only tools. However, in practice, I've seen two strong negatives with this approach:

  • it's a lot of work for developers to manage the entire dependency chain, and
  • it takes forever to get the environment running.

Consider, when you spin up a fresh VM, you need to build Perl from source, and then compile every CPAN module you depend on. This includes all the modules needed to run all the test suites. That's not going to be fast. All that, and you still need a solution that works with the distro's package management, because you still need to install all the build dependencies.

So, I'm trying a new approach - if someone else benefits from the packages I backport, even better.

Posted: 15 Feb 2014 22:20 | Tags: ,

Thu, 06 Feb 2014

FOSDEM 2014

I attended FOSDEM this year. As always, it was very busy, and the Brussels transport system was as confusing as ever. This time it was nice to accidentally bump into so many people I know from years past.

Lunar's talk on reproducible builds of Debian packages was interesting - being able to independently verify that a particular binary package was built from a particular source package is quite attractive.

Also Mailpile declared an alpha release. The bit in the talk that I didn't know already was the description of exactly how the search function works. Seeing how they integrate GPG into the contacts/compose features brought home how lacking most (all?) other mail clients are when it comes to usable encryption.

On Sunday afternoon I managed to grab a seat in the Go devroom, to hear about crazy things like how YouTube are putting a daemon called Vitess in front of MySQL to abstract away sharding (at the cost of some transaction guarantees). You would have thought Google would already have a scalable data store of some sort?

Other bits I remember: Michael Meeks talking about GPU-enabling spreadsheet formulae calculations. And hearing Wietse Venema talk about Postfix was pretty awesome.

Posted: 06 Feb 2014 08:23 | Tags:

Thu, 02 Jan 2014

OpenVPN and easy-rsa

One of those enlightenment moments that I should have had sooner: every time I have seen someone set up an OpenVPN VPN, they have generated all the certificates on the VPN server as root using easy-rsa. This is kind of strange, because you end up with an incredibly sensitive directory on the VPN server containing every private key for every client.

Another angle is whether you trust the random number generators used to create all these keys - does your hosting provider use a weak RNG?

Instead, you could set up your CA using easy-rsa on a separate machine - perhaps even air-gapped. Then private keys can be generated on each machine that wants to join the VPN, and the certificates can get signed by the CA and returned. (The easy-rsa package has been split out of the openvpn package in Debian unstable, which makes this more understandable.)

Is there a security benefit? You could argue that if your VPN server has been compromised, then you are already in trouble. But I'm thinking about a setup where I could run multiple VPN servers for redundancy, signed by the same CA - then if one server gets broken into, you could kill it without having to revoke all the client keys.

By the way, the default RSA key size used by easy-rsa is 1024 bits at the time of writing (fixed upstream: Debian bug #733905). This is simple to change, but you need to know to do it. One of the 30c3 lightning talks was about bettercrypto.org - a guide to which cryptography settings to choose for commonly used software.

Posted: 02 Jan 2014 20:56 | Tags: ,

Wed, 01 Jan 2014

2014

So, happy new year. :)

I watched many 30c3 talks via the streams over Christmas - they were awesome. I especially enjoyed finding out (in the Tor talk) that the Internet Watch Foundation need to use Tor when checking out particularly dodgy links online, else people just serve them up pictures of kittens.

Today's fail: deciding to set up OpenVPN, then realising the OpenVZ VPS I was planning to use would not support /dev/net/tun.

I'm back at work tomorrow, preparing for the January surge of people looking for jobs. Tonight, the first Southampton Perl Mongers meeting of the year.

Posted: 01 Jan 2014 18:02 | Tags:

Mon, 02 Dec 2013

How not to parse search queries

While I remember, I have uploaded the slides from my talk about Solr and Perl at the London Perl Workshop.

This talk was inspired by having seen and contributed to at least five different sets of Solr search code at my current job, all of which (I now believe) were doing it wrong. I distilled this hard-won knowledge into a 20 minute talk, which - funny story - I actually delivered twice to work around a cock-up in the printed schedule. I don't believe any video was successfully taken, but I may be proved wrong later.

I have also uploaded the Parse::Yapp grammar mentioned in the talk.

In case you don't have time to read the slides, the right way to present Solr via Perl is to use the 'edismax' parser, and write your code a bit like this:

my $solr = WebService::Solr->new($url);
my $s = $query->param('q');

# WebService::Solr::Query objects are useful for
# 'fq' params, but avoid them for main 'q' param.
my $options = {
 fq => [WebService::Solr::Query->new(...)];
};

$solr->search($s, \%options);

The key thing here is not to put any complicated parsing code in between the user and Solr. Avoid Search::QueryParser at all costs.

Posted: 02 Dec 2013 22:58 | Tags: , ,

Questhub.io

At the London Perl Workshop last Saturday, one of the lightning talks was about Questhub.io, formerly known as "play-perl.org".

It's social gamification for your task list, or something like that. Buzzword-tastic! But most importantly, there seems to be a nice community of programming types to procrastinate with you on your quests. This means I can finally get to work refuting lamby's prediction about gamification of Debian development!

Tasks are referred to as "Quests", and are pursued in themed "Realms", for that World of Warcraft feeling. For example, there's a "Perl" realm, and a "Lisp" realm, and a "Haskell" realm, but also non-programming realms like "Fitness" and "Japanese".

Of course, part of me now wants to construct a federated version which can be self-hosted. :) Another downside of questhub currently is the lack of SSL support - your session cookies are sent in plain text. I hope this changes soon.

Posted: 02 Dec 2013 21:55 | Tags: , ,

Sun, 16 Jun 2013

Sophie

It's my first Father's Day! Sophie was born 2 months ago (3345g or 7lb 6oz), and I've been on a blogging hiatus for quite a bit longer than that. She's very cute.

I am getting into the swing of fatherhood - lots of nappy changing. :) I took my two weeks of paternity leave, but spread the second "week" over two weeks by working just afternoons, which gave me lots of time with mummy and baby. We watched a DVD called "The Happiest Baby on the Block", and mastered the techniques therein (mainly swaddling and white noise). So all things considered, we're getting quite a bit of sleep.

Sophie is very curious about my typing, and leans towards anything she's interested in... so she's currently suspended at an angle besides me. Maybe she'll be interested in what her parents do, when she grows up. :) But for now, we're enjoying that she's learned to smile.

Posted: 16 Jun 2013 22:45 | Tags:

Thu, 03 Jan 2013

New Year

Another year. 2012 was busy - I got moved house twice, changed jobs, and got married. In 2013, I should become a father, fingers crossed (due mid-April). Change is a familiar friend now.

I just listened to Tom Armitage speaking about coding on Radio 4 - I /think/ the podcast mp3 link will work for people outside the UK, but the iPlayer probably won't. If you can get hold of it, it's worth the 20 minutes of your time.

If I had to make a New Year's resolution, it would be to listen to more Radio 4 - there's such a lot of it, though. I'm going to try subscribing to some of their podcasts and listening to them on my commute - timeshifting some of the best bits. Might work.

Posted: 03 Jan 2013 23:09 | Tags: , ,

Fri, 21 Dec 2012

Perl Forking, Reference Counting and Copy-on-Write

I have been dealing with an interesting forking issue at work. It happens to involve Perl, but don't let that put you off.

So, suppose you need to perform an I/O-bound task that is eminently parallelizable (in our case, generating and sending lots of emails). You have learnt from previous such attempts, and broken out Parallel::Iterator from CPAN to give you easy fork()ing goodness. Forking can be very memory-efficient, at least under the Linux kernel, because pages are shared between the parent and the children via a copy-on-write system.

Further suppose that you want to generate and share a large data structure between the children, so that you can iterate over it. Copy-on-write pages, should be cheap, right?

my $large_array_ref = get_data();

my $iter = iterate( sub {
    my $i = $_[1];
    my $element = $large_array_ref->[$i];

    ...
}, [0..1000000] );

Sadly, when you run your program, it gobbles up memory until the OOM killer steps in.

Our first problem was that the system malloc implementation was less good for this particular task than Perl's built-in malloc. Not a problem, we were using perlbrew anyway, so a quick few experimental rebuilds later and this was solved.

More interesting was the slow, 60MB/s leak that we saw after that. There were no circular references, and everything was going out of scope at the end of the function, so what was happening?

Recall that Perl uses reference counting to track memory allocation. In the children, because we took a reference to an element of the large shared data structure, we were effectively writing to the relevant page in memory, so it would get copied. Over time, as we iterated through the entire structure, the children would end up copying almost every page! This would double our memory costs. (We confirmed the diagnosis using 'smem', incidentally. Very useful.)

The copy-on-write semantics of fork() do not play well with reference-counted interpreted languages such as Perl or CPython. Apparently a similar issue occurs with some mark-and-sweep garbage-collection implementations - but Ruby 2.0 is reputed to be COW-friendly.

All was not lost, however - we just needed to avoid taking any references! Implement a deep copy that does not involve saving any intermediate variables along the way. This can be a bit long-winded, but it works.

my $large_array_ref = get_data();

my $iter = iterate( sub {
    my $i = $_[1];
    my %clone;

    $clone{id}  = $large_array_ref->[$i]{id};
    $clone{foo} = $large_array_ref->[$i]{foo};
    ...
}, [0..1000000] );

This could be improved if we wrote an XS CPAN module that cloned data structures without incrementing any reference counts - I presume this is possible. We tried the most common deep-copy modules from CPAN, but have not yet found one that avoids reference counting.

This same problem almost certainly shows up when using the Apache prefork MPM and mod_perl - even read-only global variables can become unshared.

I would be very interested to learn of any other approaches people have found to solve this sort of problem - do email me.

Posted: 21 Dec 2012 22:38 | Tags:

Sat, 27 Oct 2012

Recruiting

On Monday, I need to start hiring a Perl programmer - or, at least, a programmer willing to write Perl. I work for a website where people post their CVs, which tends to help - although this will mean that my boss wants me to do it without going through recruiters. Which is fine. I just have to use the search interface that recruiters normally use.

And looking through all these CVs, it dawned on me that I don't have a clue whether any of the people are suitable for the job. I have to search for keywords that we think might be relevant - "Perl", I guess - and then sort through the hundreds of people who come back from the search. It's very painful, because you can't really judge a CV without reading it - and even that won't necessarily tell you the important things about that person. Do they actually write good code? Do they work well in a team?

When searching for a piece of information, you probably need just one website to answer your question; when searching for job candidates, I guess you need to see a range of CVs. And then you need to interview them; this could take weeks.

Sucks to be me.

Posted: 27 Oct 2012 17:21 | Tags:

Copyright © 2007-2012 Tim Retout