I have been dealing with an interesting forking issue at work. It happens to involve Perl, but don't let that put you off.
So, suppose you need to perform an I/O-bound task that is eminently parallelizable (in our case, generating and sending lots of emails). You have learnt from previous such attempts, and broken out Parallel::Iterator from CPAN to give you easy fork()ing goodness. Forking can be very memory-efficient, at least under the Linux kernel, because pages are shared between the parent and the children via a copy-on-write system.
Further suppose that you want to generate and share a large data structure between the children, so that you can iterate over it. Copy-on-write pages, should be cheap, right?
my $large_array_ref = get_data(); my $iter = iterate( sub { my $i = $_[1]; my $element = $large_array_ref->[$i]; ... }, [0..1000000] );
Sadly, when you run your program, it gobbles up memory until the OOM killer steps in.
Our first problem was that the system malloc implementation was less good for this particular task than Perl's built-in malloc. Not a problem, we were using perlbrew anyway, so a quick few experimental rebuilds later and this was solved.
More interesting was the slow, 60MB/s leak that we saw after that. There were no circular references, and everything was going out of scope at the end of the function, so what was happening?
Recall that Perl uses reference counting to track memory allocation. In the children, because we took a reference to an element of the large shared data structure, we were effectively writing to the relevant page in memory, so it would get copied. Over time, as we iterated through the entire structure, the children would end up copying almost every page! This would double our memory costs. (We confirmed the diagnosis using 'smem', incidentally. Very useful.)
The copy-on-write semantics of fork() do not play well with reference-counted interpreted languages such as Perl or CPython. Apparently a similar issue occurs with some mark-and-sweep garbage-collection implementations - but Ruby 2.0 is reputed to be COW-friendly.
All was not lost, however - we just needed to avoid taking any references! Implement a deep copy that does not involve saving any intermediate variables along the way. This can be a bit long-winded, but it works.
my $large_array_ref = get_data(); my $iter = iterate( sub { my $i = $_[1]; my %clone; $clone{id} = $large_array_ref->[$i]{id}; $clone{foo} = $large_array_ref->[$i]{foo}; ... }, [0..1000000] );
This could be improved if we wrote an XS CPAN module that cloned data structures without incrementing any reference counts - I presume this is possible. We tried the most common deep-copy modules from CPAN, but have not yet found one that avoids reference counting.
This same problem almost certainly shows up when using the Apache prefork MPM and mod_perl - even read-only global variables can become unshared.
I would be very interested to learn of any other approaches people have found to solve this sort of problem - do email me.